Upload folder using huggingface_hub
Browse files- README.md +138 -0
- merges.txt +0 -0
- tokenizer.json +0 -0
- tokenizer_config.json +1 -0
- vocab.json +0 -0
README.md
ADDED
@@ -0,0 +1,138 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language:
|
3 |
+
- en
|
4 |
+
- es
|
5 |
+
- fr
|
6 |
+
- de
|
7 |
+
library_name: tokenizers
|
8 |
+
license: cc-by-4.0
|
9 |
+
tags:
|
10 |
+
- kl3m
|
11 |
+
- kl3m-004
|
12 |
+
- alea
|
13 |
+
- legal
|
14 |
+
- financial
|
15 |
+
date: '2024-12-30T00:00:00.000Z'
|
16 |
+
---
|
17 |
+
|
18 |
+
# kl3m-004-char-16k-cased
|
19 |
+
|
20 |
+
The `kl3m-004-char-16k-cased` **case-sensitive** tokenizer is a domain-specific **character-based** tokenizer trained
|
21 |
+
on a stratified sample of nearly 2M documents across general, legal, and financial domains from the `kl3m-data` project,
|
22 |
+
including American English, British English, Spanish, German, French, Italian, and other common EU languages.
|
23 |
+
|
24 |
+
This tokenizer uses the standard Byte-Pair Encoding (BPE) tokenizer from `tokenizers`/`transformers`, but modifies the
|
25 |
+
training process to restrict the vocabulary to tokens that are at most 3 characters long. Models trained with this tokenizer
|
26 |
+
should be able to handle a number of use cases that are otherwise difficult to handle with standard tokenizers, such as
|
27 |
+
low-resource spell-checking, OCR correction, whitespace normalization, and other tasks that require a high degree of character-level
|
28 |
+
granularity.
|
29 |
+
|
30 |
+
## Model Details
|
31 |
+
|
32 |
+
### Summary
|
33 |
+
|
34 |
+
- **Vocabulary**: 16,384 tokens
|
35 |
+
- **Tokenizer type:** BPE with 1-4 character tokens
|
36 |
+
- **Special token support:** Both causal and masked language modeling
|
37 |
+
- **Language(s) (NLP):** Primarily English, Spanish, German, French, with a small percentage of other EU languages.
|
38 |
+
- **Data Sources**: See [`kl3m-data`](https://github.com/alea-institute/kl3m-data) repository.
|
39 |
+
- **Developed by:** [ALEA Institute](https://aleainstitute.ai).
|
40 |
+
- **License:** [CC-BY 4.0](https://creativecommons.org/licenses/by/4.0/)
|
41 |
+
|
42 |
+
For more information about the `kl3m-004` tokenizers, see the [kl3m-004-128k-cased tokenizer](https://huggingface.co/alea-institute/kl3m-004-128k-cased).
|
43 |
+
|
44 |
+
#### Special Tokens for both Embedding and Generative Models
|
45 |
+
|
46 |
+
For both training and inference efficiency, we intended this tokenizer vocabulary to be
|
47 |
+
usable for both embedding and generative models. As such, we included special tokens
|
48 |
+
suitable for both causal and masked language modeling tasks.
|
49 |
+
|
50 |
+
* `<|start|>`: `0`
|
51 |
+
* `<|end|>`: `1`
|
52 |
+
* `<|pad|>`: `2`
|
53 |
+
* `<|unk|>`: `3`
|
54 |
+
* `<|sep|>`: `4`
|
55 |
+
* `<|cls|>`: `5`
|
56 |
+
* `<|mask|>`: `6`
|
57 |
+
|
58 |
+
We also added a number of chat and instruction tokens that were not included in `kl3m-001-32k`, including:
|
59 |
+
|
60 |
+
* `<|system|>`: `7`
|
61 |
+
* `</|system|>`: `8`
|
62 |
+
* `<|user|>`: `9`
|
63 |
+
* `</|user|>`: `10`
|
64 |
+
* `<|instruction|>`: `11`
|
65 |
+
* `</|instruction|>`: `12`
|
66 |
+
|
67 |
+
These tokens are identical to those used in the `kl3m-003-64k` tokenizer.
|
68 |
+
|
69 |
+
### Replication
|
70 |
+
|
71 |
+
The entire data collection and preprocesing pipeline is being made available, along with
|
72 |
+
training data, as part of the [ALEA Institute](https://aleainstitute.ai) [KL3M project](https://aleainstitute.ai/work/kl3m/).
|
73 |
+
|
74 |
+
The source code to used to train the tokenizer is available on GitHub at:
|
75 |
+
[https://github.com/alea-institute/kl3m-embedding-research](https://github.com/alea-institute/kl3m-embedding-research)
|
76 |
+
|
77 |
+
The data pipeline will be available on GitHub and S3 in the near future.
|
78 |
+
|
79 |
+
This specific tokenizer was trained using the following command:
|
80 |
+
|
81 |
+
```bash
|
82 |
+
PYTHONPATH=. poetry run python3 \
|
83 |
+
kl3m_tokenizers/tokenizers/kl3m_004/train_char_tokenizer.py \
|
84 |
+
--min_frequency 1000 \
|
85 |
+
--vocab_size 16384 \
|
86 |
+
--pad2 \
|
87 |
+
--max_chars 4 \
|
88 |
+
sample.20241223173012.jsonl.gz \
|
89 |
+
./kl3m-004-char-16k-cased/
|
90 |
+
```
|
91 |
+
|
92 |
+
```text
|
93 |
+
Training tokenizer.
|
94 |
+
[00:33:12] Pre-processing sequences βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 1849344 / 0
|
95 |
+
[00:33:32] Pre-processing sequences βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 0 / 0
|
96 |
+
[00:00:21] Tokenize words βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20286360 / 20286360
|
97 |
+
[00:01:01] Count pairs βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 20286360 / 20286360
|
98 |
+
[00:12:39] Compute merges βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ 16036 / 16036
|
99 |
+
Adding power-of-2 padding tokens.
|
100 |
+
Padded vocab to 16384 tokens.
|
101 |
+
Special tokens: 13
|
102 |
+
Power-of-2 pad tokens: 13
|
103 |
+
Final vocab size: 16384
|
104 |
+
Training time: 2863.67 seconds
|
105 |
+
Output path: kl3m-004-char-16k-cased
|
106 |
+
```
|
107 |
+
|
108 |
+
### Uses
|
109 |
+
This tokenizer is intended to be used for English, Spanish, German, or French language tasks where
|
110 |
+
character-level details are important, such as OCR correction, spell-checking, or tasks where word boundaries
|
111 |
+
are not well-defined.
|
112 |
+
|
113 |
+
For a standard BPE "word" tokenizer with a larger vocabulary size, consider using the `kl3m-004-128k-cased` or
|
114 |
+
`kl3m-004-128k-uncased` tokenizers.
|
115 |
+
|
116 |
+
### Recommendations
|
117 |
+
The kl3m-004-char-16k-cased tokenizer may be particularly useful when character-level details are important but
|
118 |
+
resource constraints are not as severe. For smaller vocabularies with better resource efficiency, consider using the
|
119 |
+
kl3m-004-char-4k-cased or kl3m-004-char-8k-cased tokenizers.
|
120 |
+
|
121 |
+
### How to Get Started with the Model
|
122 |
+
Use the code below to get started with the model.
|
123 |
+
|
124 |
+
```
|
125 |
+
from tokenizers import Tokenizer
|
126 |
+
|
127 |
+
tokenizer = Tokenizer.from_pretrained('alea-institute/kl3m-004-char-16k-cased')
|
128 |
+
```
|
129 |
+
|
130 |
+
### Citation
|
131 |
+
Tokenizer and dataset publications are pending.
|
132 |
+
|
133 |
+
## Contact
|
134 |
+
|
135 |
+
For any questions, please contact [ALEA Institute](https://aleainstitute.ai) at [[email protected]](mailto:[email protected]) or
|
136 |
+
create an issue on this repository or [GitHub](https://github.com/alea-institute/kl3m-embedding-research).
|
137 |
+
|
138 |
+
![logo](https://aleainstitute.ai/images/alea-logo-ascii-1x1.png)
|
merges.txt
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|
tokenizer_config.json
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
{"unk_token": "<|unk|>", "bos_token": "<|start|>", "eos_token": "<|end|>", "pad_token": "<|pad|>", "sep_token": "<|sep|>", "cls_token": "<|cls|>", "mask_token": "<|mask|>", "add_prefix_space": false, "do_lower_case": false, "tokenizer_class": "PreTrainedTokenizerFast"}
|
vocab.json
ADDED
The diff for this file is too large to render.
See raw diff
|
|