Model Card
Browse filesHi!👋
This PR has a some additional information for the model card, based on the format we are using as part of our effort to standardise model cards at Hugging Face. Feel free to merge if you are ok with the changes! (cc
@Marissa
@Meg
)
README.md
CHANGED
@@ -12,24 +12,81 @@ license: gpl-3.0
|
|
12 |
|
13 |
# CKIP ALBERT Tiny Chinese
|
14 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
15 |
This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
|
16 |
|
17 |
這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。
|
18 |
|
19 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
20 |
|
21 |
-
- https://github.com/ckiplab/ckip-transformers
|
22 |
|
23 |
-
## Contributers
|
24 |
|
25 |
-
|
26 |
|
27 |
-
|
28 |
|
29 |
-
|
30 |
|
31 |
請使用 BertTokenizerFast 而非 AutoTokenizer。
|
32 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
33 |
```
|
34 |
from transformers import (
|
35 |
BertTokenizerFast,
|
@@ -40,6 +97,4 @@ tokenizer = BertTokenizerFast.from_pretrained('bert-base-chinese')
|
|
40 |
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')
|
41 |
```
|
42 |
|
43 |
-
For full usage and more information, please refer to https://github.com/ckiplab/ckip-transformers.
|
44 |
|
45 |
-
有關完整使用方法及其他資訊,請參見 https://github.com/ckiplab/ckip-transformers 。
|
|
|
12 |
|
13 |
# CKIP ALBERT Tiny Chinese
|
14 |
|
15 |
+
## Table of Contents
|
16 |
+
- [Model Details](#model-details)
|
17 |
+
- [Uses](#uses)
|
18 |
+
- [Risks, Limitations and Biases](#risks-limitations-and-biases)
|
19 |
+
- [Training](#training)
|
20 |
+
- [Evaluation](#evaluation)
|
21 |
+
- [How to Get Started With the Model](#how-to-get-started-with-the-model)
|
22 |
+
|
23 |
+
## Model Details
|
24 |
+
- **Model Description:**
|
25 |
+
|
26 |
This project provides traditional Chinese transformers models (including ALBERT, BERT, GPT2) and NLP tools (including word segmentation, part-of-speech tagging, named entity recognition).
|
27 |
|
28 |
這個專案提供了繁體中文的 transformers 模型(包含 ALBERT、BERT、GPT2)及自然語言處理工具(包含斷詞、詞性標記、實體辨識)。
|
29 |
|
30 |
+
- **Developed by:** [Mu Yang](https://muyang.pro) at [CKIP](https://ckip.iis.sinica.edu.tw)
|
31 |
+
- **Model Type:** Fill-Mask
|
32 |
+
- **Language(s):** Chinese
|
33 |
+
- **License:** gpl-3.0
|
34 |
+
- **Parent Model:** See the [ALBERT base model](https://huggingface.co/albert-base-v2) for more information about the ALBERT base model.
|
35 |
+
- **Resources for more information:**
|
36 |
+
- [GitHub Repo](https://github.com/ckiplab/ckip-transformers)
|
37 |
+
- [CKIP Documentation](https://ckip-transformers.readthedocs.io/en/stable/)
|
38 |
|
|
|
39 |
|
|
|
40 |
|
41 |
+
## Uses
|
42 |
|
43 |
+
#### Direct Use
|
44 |
|
45 |
+
The model author suggests using BertTokenizerFast as tokenizer instead of AutoTokenizer.
|
46 |
|
47 |
請使用 BertTokenizerFast 而非 AutoTokenizer。
|
48 |
|
49 |
+
For full usage and more information, please refer to [github repository] (https://github.com/ckiplab/ckip-transformers.)
|
50 |
+
|
51 |
+
有關完整使用方法及其他資訊,請參見 [github repository] (https://github.com/ckiplab/ckip-transformers.)
|
52 |
+
|
53 |
+
|
54 |
+
|
55 |
+
## Risks, Limitations and Biases
|
56 |
+
**CONTENT WARNING: Readers should be aware this section contains content that is disturbing, offensive, and can propagate historical and current stereotypes.**
|
57 |
+
|
58 |
+
Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)).
|
59 |
+
|
60 |
+
|
61 |
+
## Training
|
62 |
+
|
63 |
+
|
64 |
+
#### Training Data
|
65 |
+
|
66 |
+
The language models are trained on the ZhWiki and CNA datasets; the WS and POS tasks are trained on the ASBC dataset; the NER tasks are trained on the OntoNotes dataset.
|
67 |
+
以上的語言模型訓練於 ZhWiki 與 CNA 資料集上;斷詞(WS)與詞性標記(POS)任務模型訓練於 ASBC 資料集上;實體辨識(NER)任務模型訓練於 OntoNotes 資料集上。
|
68 |
+
|
69 |
+
#### Training Procedure
|
70 |
+
* **Parameters:** 4M
|
71 |
+
|
72 |
+
|
73 |
+
|
74 |
+
## Evaluation
|
75 |
+
|
76 |
+
|
77 |
+
|
78 |
+
|
79 |
+
#### Results
|
80 |
+
|
81 |
+
|
82 |
+
|
83 |
+
* **Perplexity:** 4.40
|
84 |
+
* **WOS (Word Segmentation) [F1]:** 96.66%
|
85 |
+
* **POS (Part-of-speech) [ACC]:** 94.48%
|
86 |
+
* **NER (Named-entity recognition) [F1]:** 71.17%
|
87 |
+
|
88 |
+
## How to Get Started With the Model
|
89 |
+
|
90 |
```
|
91 |
from transformers import (
|
92 |
BertTokenizerFast,
|
|
|
97 |
model = AutoModel.from_pretrained('ckiplab/albert-tiny-chinese')
|
98 |
```
|
99 |
|
|
|
100 |
|
|