malteos commited on
Commit
a2fccc6
·
1 Parent(s): fa589b7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +28 -0
README.md CHANGED
@@ -1,3 +1,31 @@
1
  ---
2
  license: mit
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
  ---
4
+
5
+ A GPT2-tokenizer for English and German with a vocabulary size of 88,301.
6
+
7
+ This tokenizer is created by merging the [original GPT2](https://huggingface.co/gpt2) tokenizer (English) with a [German tokenizer](https://huggingface.co/malteos/gpt2-xl-wechsel-german).
8
+
9
+ ## Steps to reproduce
10
+
11
+ ```python
12
+ from transformers import AutoTokenizer
13
+
14
+ a_tokenizer = AutoTokenizer.from_pretrained('gpt2')
15
+ b_tokenizer = AutoTokenizer.from_pretrained('malteos/gpt2-xl-wechsel-german')
16
+
17
+ a_vocab = set(a_tokenizer.vocab.keys()) # len(a_vocab)=50257
18
+ b_vocab = set(b_tokenizer.vocab.keys()) # len(b_vocab)=50257
19
+
20
+ missing_tokens_in_a = b_vocab - a_vocab # len = 38044
21
+
22
+ a_tokenizer.add_tokens(list(missing_tokens_in_a))
23
+
24
+ a_tokenizer.save('opengptx-en-de') # len = 88301
25
+
26
+
27
+
28
+ ```
29
+
30
+
31
+