Commit
·
2d84e3f
1
Parent(s):
20746b1
update model card
Browse files
README.md
CHANGED
@@ -2,9 +2,9 @@
|
|
2 |
language:
|
3 |
- en
|
4 |
library_name: fasttext
|
|
|
5 |
tags:
|
6 |
- text
|
7 |
-
- text-classification
|
8 |
- semantic-similarity
|
9 |
- earnings-call-transcripts
|
10 |
- word2vec
|
@@ -20,4 +20,72 @@ widget:
|
|
20 |
example_title: "disruption"
|
21 |
---
|
22 |
|
23 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2 |
language:
|
3 |
- en
|
4 |
library_name: fasttext
|
5 |
+
pipeline_tag: text-classification
|
6 |
tags:
|
7 |
- text
|
|
|
8 |
- semantic-similarity
|
9 |
- earnings-call-transcripts
|
10 |
- word2vec
|
|
|
20 |
example_title: "disruption"
|
21 |
---
|
22 |
|
23 |
+
# EarningsCall2Vec
|
24 |
+
This is a [fastText](https://fasttext.cc/) model trained via [`Gensim`](https://radimrehurek.com/gensim/): It maps each token in the vocabulary (i.e., unigram and frequently coocurring bi-, tri-, and fourgrams) to a dense, 300-dimensional vector space, designed for performing **semantic search**. It has been trained on corpus of ~160k earning call transcripts, in particular the executive remarks within the Q&A-section of these transcripts (13m sentences).
|
25 |
+
|
26 |
+
## Usage (API)
|
27 |
+
```
|
28 |
+
pip install -U xxx
|
29 |
+
```
|
30 |
+
Then you can use the model like this:
|
31 |
+
```python
|
32 |
+
py code
|
33 |
+
```
|
34 |
+
|
35 |
+
## Usage (Gensim)
|
36 |
+
```
|
37 |
+
pip install -U xxx
|
38 |
+
```
|
39 |
+
Then you can use the model like this:
|
40 |
+
```python
|
41 |
+
py code
|
42 |
+
```
|
43 |
+
|
44 |
+
## Background
|
45 |
+
|
46 |
+
Context on the project.
|
47 |
+
|
48 |
+
|
49 |
+
## Intended Uses
|
50 |
+
|
51 |
+
Our model is intented to be used for semantic search on a token-level: It encodes search-queries (i.e., token) in a dense vector space and finds semantic neighbours, i.e., token which frequently occur within similar contexts in the underlying training data. Note that this search is only feasible for individual token and may produce deficient results in the case of out-of-vocabulary token.
|
52 |
+
|
53 |
+
|
54 |
+
## Training procedure
|
55 |
+
|
56 |
+
```python
|
57 |
+
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)
|
58 |
+
|
59 |
+
# init
|
60 |
+
model = FastText(
|
61 |
+
vector_size=300,
|
62 |
+
window=5,
|
63 |
+
min_count=10,
|
64 |
+
alpha=0.025,
|
65 |
+
negative = 5,
|
66 |
+
seed=2021,
|
67 |
+
sample = 0.001,
|
68 |
+
sg=1,
|
69 |
+
hs=0,
|
70 |
+
max_vocab_size=None,
|
71 |
+
workers=10,
|
72 |
+
)
|
73 |
+
|
74 |
+
# build vocab
|
75 |
+
model.build_vocab(corpus_iterable=LineSentence(<PATH_TRAIN_DATA>))
|
76 |
+
|
77 |
+
# train model
|
78 |
+
model.train(
|
79 |
+
corpus_iterable=LineSentence(<PATH_TRAIN_DATA>),
|
80 |
+
total_words=model.corpus_total_words,
|
81 |
+
total_examples=model.corpus_count,
|
82 |
+
epochs=50,
|
83 |
+
)
|
84 |
+
|
85 |
+
# save to binary format
|
86 |
+
save_facebook_model(<PATH_MOD_SAVE>)
|
87 |
+
```
|
88 |
+
|
89 |
+
## Training Data
|
90 |
+
|
91 |
+
description
|