alea-institute commited on
Commit
3c1b96d
Β·
verified Β·
1 Parent(s): c48ada8

Upload folder using huggingface_hub

Browse files
Files changed (5) hide show
  1. README.md +138 -0
  2. merges.txt +0 -0
  3. tokenizer.json +0 -0
  4. tokenizer_config.json +1 -0
  5. vocab.json +0 -0
README.md ADDED
@@ -0,0 +1,138 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ - es
5
+ - fr
6
+ - de
7
+ library_name: tokenizers
8
+ license: cc-by-4.0
9
+ tags:
10
+ - kl3m
11
+ - kl3m-004
12
+ - alea
13
+ - legal
14
+ - financial
15
+ date: '2024-12-30T00:00:00.000Z'
16
+ ---
17
+
18
+ # kl3m-004-char-16k-cased
19
+
20
+ The `kl3m-004-char-16k-cased` **case-sensitive** tokenizer is a domain-specific **character-based** tokenizer trained
21
+ on a stratified sample of nearly 2M documents across general, legal, and financial domains from the `kl3m-data` project,
22
+ including American English, British English, Spanish, German, French, Italian, and other common EU languages.
23
+
24
+ This tokenizer uses the standard Byte-Pair Encoding (BPE) tokenizer from `tokenizers`/`transformers`, but modifies the
25
+ training process to restrict the vocabulary to tokens that are at most 3 characters long. Models trained with this tokenizer
26
+ should be able to handle a number of use cases that are otherwise difficult to handle with standard tokenizers, such as
27
+ low-resource spell-checking, OCR correction, whitespace normalization, and other tasks that require a high degree of character-level
28
+ granularity.
29
+
30
+ ## Model Details
31
+
32
+ ### Summary
33
+
34
+ - **Vocabulary**: 16,384 tokens
35
+ - **Tokenizer type:** BPE with 1-4 character tokens
36
+ - **Special token support:** Both causal and masked language modeling
37
+ - **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
38
+ - **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
39
+ - **Developed by:** [ALEA Institute](https://aleainstitute.ai).
40
+ - **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
41
+
42
+ For more information about the `kl3m-004` tokenizers, see the [kl3m-004-128k-cased tokenizer](https://huggingface.co/alea-institute/kl3m-004-128k-cased).
43
+
44
+ #### Special Tokens for both Embedding and Generative Models
45
+
46
+ For both training and inference efficiency, we intended this tokenizer vocabulary to be
47
+ usable for both embedding and generative models. As such, we included special tokens
48
+ suitable for both causal and masked language modeling tasks.
49
+
50
+ * `<|start|>`: `0`
51
+ * `<|end|>`: `1`
52
+ * `<|pad|>`: `2`
53
+ * `<|unk|>`: `3`
54
+ * `<|sep|>`: `4`
55
+ * `<|cls|>`: `5`
56
+ * `<|mask|>`: `6`
57
+
58
+ We also added a number of chat and instruction tokens that were not included in `kl3m-001-32k`, including:
59
+
60
+ * `<|system|>`: `7`
61
+ * `</|system|>`: `8`
62
+ * `<|user|>`: `9`
63
+ * `</|user|>`: `10`
64
+ * `<|instruction|>`: `11`
65
+ * `</|instruction|>`: `12`
66
+
67
+ These tokens are identical to those used in the `kl3m-003-64k` tokenizer.
68
+
69
+ ### Replication
70
+
71
+ The entire data collection and preprocesing pipeline is being made available, along with
72
+ training data, as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
73
+
74
+ The source code to used to train the tokenizer is available on GitHub at:
75
+ [https://github.com/alea-institute/kl3m-embedding-research](https://github.com/alea-institute/kl3m-embedding-research)
76
+
77
+ The data pipeline will be available on GitHub and S3 in the near future.
78
+
79
+ This specific tokenizer was trained using the following command:
80
+
81
+ ```bash
82
+ PYTHONPATH=. poetry run python3 \
83
+ kl3m_tokenizers/tokenizers/kl3m_004/train_char_tokenizer.py \
84
+ --min_frequency 1000 \
85
+ --vocab_size 16384 \
86
+ --pad2 \
87
+ --max_chars 4 \
88
+ sample.20241223173012.jsonl.gz \
89
+ ./kl3m-004-char-16k-cased/
90
+ ```
91
+
92
+ ```text
93
+ Training tokenizer.
94
+ [00:33:12] Pre-processing sequences β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 1849344 / 0
95
+ [00:33:32] Pre-processing sequences β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 0 / 0
96
+ [00:00:21] Tokenize words β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 20286360 / 20286360
97
+ [00:01:01] Count pairs β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 20286360 / 20286360
98
+ [00:12:39] Compute merges β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ 16036 / 16036
99
+ Adding power-of-2 padding tokens.
100
+ Padded vocab to 16384 tokens.
101
+ Special tokens: 13
102
+ Power-of-2 pad tokens: 13
103
+ Final vocab size: 16384
104
+ Training time: 2863.67 seconds
105
+ Output path: kl3m-004-char-16k-cased
106
+ ```
107
+
108
+ ### Uses
109
+ This tokenizer is intended to be used for English, Spanish, German, or French language tasks where
110
+ character-level details are important, such as OCR correction, spell-checking, or tasks where word boundaries
111
+ are not well-defined.
112
+
113
+ For a standard BPE "word" tokenizer with a larger vocabulary size, consider using the `kl3m-004-128k-cased` or
114
+ `kl3m-004-128k-uncased` tokenizers.
115
+
116
+ ### Recommendations
117
+ The kl3m-004-char-16k-cased tokenizer may be particularly useful when character-level details are important but
118
+ resource constraints are not as severe. For smaller vocabularies with better resource efficiency, consider using the
119
+ kl3m-004-char-4k-cased or kl3m-004-char-8k-cased tokenizers.
120
+
121
+ ### How to Get Started with the Model
122
+ Use the code below to get started with the model.
123
+
124
+ ```
125
+ from tokenizers import Tokenizer
126
+
127
+ tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-char-16k-cased')
128
+ ```
129
+
130
+ ### Citation
131
+ Tokenizer and dataset publications are pending.
132
+
133
+ ## Contact
134
+
135
+ For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
136
+ create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-embedding-research).
137
+
138
+ ![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
merges.txt ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {"unk_token": "<|unk|>", "bos_token": "<|start|>", "eos_token": "<|end|>", "pad_token": "<|pad|>", "sep_token": "<|sep|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "add_prefix_space": false, "do_lower_case": false, "tokenizer_class": "PreTrainedTokenizerFast"}
vocab.json ADDED
The diff for this file is too large to render. See raw diff