hyunwoongko commited on
Commit
1a735fe
·
1 Parent(s): f80e757

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -8
README.md CHANGED
@@ -10,7 +10,7 @@ license: apache-2.0
10
  # Polyglot-Ko-1.3B
11
 
12
  ## Model Description
13
- Polyglot-Ko is a Korean autoregressive language model made by EleutherAI polyglot team. We collected about 1.2TB Korean dataset for this work, which was done with [TUNiB](https://tunib.ai/). In addition, we used the [GPT-NeoX framework](https://github.com/EleutherAI/gpt-neox) for model training and added several Korean tasks to [LM-Evaluation-Harness](https://github.com/EleutherAI/lm-evaluation-harness) for model evaluation.
14
 
15
  | Hyperparameter | Value |
16
  |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
@@ -21,13 +21,13 @@ Polyglot-Ko is a Korean autoregressive language model made by EleutherAI polyglo
21
  | \\(n_{heads}\\) | 16 |
22
  | \\(d_{head}\\) | 128 |
23
  | \\(n_{ctx}\\) | 2048 |
24
- | \\(n_{vocab}\\) | 30,000 / 30,080 |
25
  | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
26
  | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
27
 
28
  The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
29
  dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
30
- dimensions of each head. The model is trained with a tokenization vocabulary of 30000.
31
 
32
  ## Training data
33
 
@@ -60,19 +60,19 @@ General training algorithms for pretrained language model have many hazards that
60
  * `<|tell|>` : phone number
61
 
62
  ### Limitations and Biases
63
-
64
  The core functionality of Polyglot-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting Polyglot-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon Polyglot-Ko to produce factually accurate output.Depending upon use case Polyglot-Ko may produce socially unacceptable text.
65
-
66
  As with all language models, it is hard to predict in advance how Polyglot-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
67
 
68
  ### Legal Restrictions
69
  Since there are laws in many countries related to data collection, we will collect data with due regard to the laws of those countries.
70
  Additionally, we plan to use dataset to train our models, but we do not plan to make the dataset publicly available.
71
 
 
72
  ## Evaluation results
73
- We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks for model evaluation.
74
- We added the corresponding tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
75
- The following tables show the evaluation results with the various number of few-shot examples. You can reproduce these results using [polyglot branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot) and the following scripts.
 
76
 
77
  ```console
78
  python main.py \
 
10
  # Polyglot-Ko-1.3B
11
 
12
  ## Model Description
13
+ Polyglot-Ko is a series of large-scale Korean autoregressive language models made by the EleutherAI polyglot team. Polyglot-Ko-1.3B is the first and the smallest one.
14
 
15
  | Hyperparameter | Value |
16
  |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
 
21
  | \\(n_{heads}\\) | 16 |
22
  | \\(d_{head}\\) | 128 |
23
  | \\(n_{ctx}\\) | 2048 |
24
+ | \\(n_{vocab}\\) | 30,003 / 30,080 |
25
  | Positional Encoding | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864) |
26
  | RoPE Dimensions | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
27
 
28
  The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
29
  dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
30
+ dimensions of each head. The model is trained with a tokenization vocabulary of 30003.
31
 
32
  ## Training data
33
 
 
60
  * `<|tell|>` : phone number
61
 
62
  ### Limitations and Biases
 
63
  The core functionality of Polyglot-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting Polyglot-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon Polyglot-Ko to produce factually accurate output.Depending upon use case Polyglot-Ko may produce socially unacceptable text.
 
64
  As with all language models, it is hard to predict in advance how Polyglot-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
65
 
66
  ### Legal Restrictions
67
  Since there are laws in many countries related to data collection, we will collect data with due regard to the laws of those countries.
68
  Additionally, we plan to use dataset to train our models, but we do not plan to make the dataset publicly available.
69
 
70
+
71
  ## Evaluation results
72
+ We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks, for evaluation.
73
+ We added those tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
74
+ We evaluted our model as well as two other Korean language models, i.e., skt/ko-gpt-trinity-1.2B-v0.5 and kakaobrain/kogpt for comparison.
75
+ The following tables show the results when the number of few-shot examples differ. You can reproduce these results using [polyglot branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot) and the following scripts.
76
 
77
  ```console
78
  python main.py \