--- library_name: transformers tags: [] --- # Model Card for Model ID English & Greek Tokenizer trained from scratch ### Direct Use ```python from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gsar78/tokenizer_BPE_en_el") from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("gsar78/tokenizer_BPE_en_el") # Tokenize input text input_text = "This is a game" inputs = tokenizer(input_text, return_tensors="pt") # Print the tokenized input (IDs and tokens) print("Token IDs:", inputs["input_ids"].tolist()) # Convert token IDs to tokens tokens = tokenizer.convert_ids_to_tokens(inputs["input_ids"][0]) print("Tokens:", tokens) # Manually join tokens to form the tokenized string tokenized_string = ' '.join(tokens) print("Tokenized String:", tokenized_string) ``` ```context # Output: Token IDs: [[2967, 317, 220, 1325]] Tokens: ['This', 'Ġis', 'Ġa', 'Ġgame'] Tokenized String: This Ġis Ġa Ġgame ``` ### Recommendations When tokenizing Greek, Greek tokens may appear as gibberish, but actually this does not impact the downstream model pretraining. (An improved version of this tokenizer, without the gibberish Greek tokens can be found here: gsar78/Greek_Tokenizer) Can be used a good start for pretraining a GPT-based model or any other model using BPE.