EleutherAI
/

polyglot-ko-1.3b

@@ -10,7 +10,7 @@ license: apache-2.0
 # Polyglot-Ko-1.3B
 ## Model Description
-Polyglot-Ko is a Korean autoregressive language model made by EleutherAI polyglot team. We collected about 1.2TB Korean dataset for this work, which was done with [TUNiB](https://tunib.ai/). In addition, we used the [GPT-NeoX framework](https://github.com/EleutherAI/gpt-neox) for model training and added several Korean tasks to [LM-Evaluation-Harness](https://github.com/EleutherAI/lm-evaluation-harness) for model evaluation.
 | Hyperparameter       | Value                                                                                                                                  |
 |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
@@ -21,13 +21,13 @@ Polyglot-Ko is a Korean autoregressive language model made by EleutherAI polyglo
 | \\(n_{heads}\\)      | 16                                                                                                                                     |
 | \\(d_{head}\\)       | 128                                                                                                                                    |
 | \\(n_{ctx}\\)        | 2048                                                                                                                                   |
-| \\(n_{vocab}\\)      | 30,000 / 30,080                                                                                                                        |
 | Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)                                                                   |
 | RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
 The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
 dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
-dimensions of each head. The model is trained with a tokenization vocabulary of 30000.
 ## Training data
@@ -60,19 +60,19 @@ General training algorithms for pretrained language model have many hazards that
 * `<|tell|>` : phone number
 ### Limitations and Biases
 The core functionality of Polyglot-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting Polyglot-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon Polyglot-Ko to produce factually accurate output.Depending upon use case Polyglot-Ko may produce socially unacceptable text.
 As with all language models, it is hard to predict in advance how Polyglot-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ### Legal Restrictions
 Since there are laws in many countries related to data collection, we will collect data with due regard to the laws of those countries.
 Additionally, we plan to use dataset to train our models, but we do not plan to make the dataset publicly available.
 ## Evaluation results
-We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks for model evaluation.
-We added the corresponding tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
-The following tables show the evaluation results with the various number of few-shot examples. You can reproduce these results using [polyglot branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot) and the following scripts.
 ```console
 python main.py \

 # Polyglot-Ko-1.3B
 ## Model Description
+Polyglot-Ko is a series of large-scale Korean autoregressive language models made by the EleutherAI polyglot team. Polyglot-Ko-1.3B is the first and the smallest one.
 | Hyperparameter       | Value                                                                                                                                  |
 |----------------------|----------------------------------------------------------------------------------------------------------------------------------------|
 | \\(n_{heads}\\)      | 16                                                                                                                                     |
 | \\(d_{head}\\)       | 128                                                                                                                                    |
 | \\(n_{ctx}\\)        | 2048                                                                                                                                   |
+| \\(n_{vocab}\\)      | 30,003 / 30,080                                                                                                                        |
 | Positional Encoding  | [Rotary Position Embedding (RoPE)](https://arxiv.org/abs/2104.09864)                                                                   |
 | RoPE Dimensions      | [64](https://github.com/kingoflolz/mesh-transformer-jax/blob/f2aa66e0925de6593dcbb70e72399b97b4130482/mesh_transformer/layers.py#L223) |
 The model consists of 24 transformer layers with a model dimension of 2048, and a feedforward dimension of 8192. The model
 dimension is split into 16 heads, each with a dimension of 128. Rotary Position Embedding (RoPE) is applied to 64
+dimensions of each head. The model is trained with a tokenization vocabulary of 30003.
 ## Training data
 * `<|tell|>` : phone number
 ### Limitations and Biases
 The core functionality of Polyglot-Ko is taking a string of text and predicting the next token. While language models are widely used for tasks other than this, there are a lot of unknowns with this work. When prompting Polyglot-Ko it is important to remember that the statistically most likely next token is often not the token that produces the most "accurate" text. Never depend upon Polyglot-Ko to produce factually accurate output.Depending upon use case Polyglot-Ko may produce socially unacceptable text.
 As with all language models, it is hard to predict in advance how Polyglot-Ko will respond to particular prompts and offensive content may occur without warning. We recommend having a human curate or filter the outputs before releasing them, both to censor undesirable content and to improve the quality of the results.
 ### Legal Restrictions
 Since there are laws in many countries related to data collection, we will collect data with due regard to the laws of those countries.
 Additionally, we plan to use dataset to train our models, but we do not plan to make the dataset publicly available.
 ## Evaluation results
+We used the [KOBEST dataset](https://arxiv.org/abs/2204.04541), which consists of five Korean downstream tasks, for evaluation.
+We added those tasks to [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) and utilized prompt templates described in the paper.
+We evaluted our model as well as two other Korean language models, i.e., skt/ko-gpt-trinity-1.2B-v0.5 and kakaobrain/kogpt for comparison.
+The following tables show the results when the number of few-shot examples differ. You can reproduce these results using [polyglot branch of lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness/tree/polyglot) and the following scripts.
 ```console
 python main.py \