malteos
/

tokenizer-test

Model card Files Files and versions

malteos commited on Oct 10, 2022

Commit

a2fccc6

·

1 Parent(s): fa589b7

Update README.md

Files changed (1) hide show

README.md +28 -0

README.md CHANGED Viewed

@@ -1,3 +1,31 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+A GPT2-tokenizer for English and German with a vocabulary size of 88,301.
+This tokenizer is created by merging the [original GPT2](https://huggingface.co/gpt2) tokenizer (English) with a [German tokenizer](https://huggingface.co/malteos/gpt2-xl-wechsel-german).
+## Steps to reproduce
+```python
+from transformers import AutoTokenizer
+a_tokenizer = AutoTokenizer.from_pretrained('gpt2')
+b_tokenizer = AutoTokenizer.from_pretrained('malteos/gpt2-xl-wechsel-german')
+a_vocab = set(a_tokenizer.vocab.keys())  # len(a_vocab)=50257
+b_vocab = set(b_tokenizer.vocab.keys())  # len(b_vocab)=50257
+missing_tokens_in_a = b_vocab - a_vocab  # len = 38044
+a_tokenizer.add_tokens(list(missing_tokens_in_a))
+a_tokenizer.save('opengptx-en-de') # len = 88301
+```