Spaces:
Runtime error
Runtime error
title: GermanToEnglish | |
emoji: 🔥 | |
colorFrom: gray | |
colorTo: yellow | |
sdk: gradio | |
sdk_version: 4.25.0 | |
app_file: app.py | |
pinned: false | |
license: mit | |
The translator app: | |
 | |
# Model Name | |
German to English Translator | |
# Model Description | |
This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training. | |
- **Developed by:** Neelima Monjusha Preeti | |
- **Model type:** Seq2SeqTransformer | |
- **Language(s):** Python | |
- **License:** MIT | |
- **Contact:** [email protected] | |
# Task Description | |
This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer. | |
Then as output the language is english. | |
# Data Processing | |
Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library). | |
The get_tokenizer function from spaCy is used to obtain tokenizers for each language. | |
A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages. | |
Special symbols and indices: | |
Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX). | |
Special symbols are defined as ['<unk>', '<pad>', '<bos>', '<eos>']. | |
Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function. | |
It uses the tokenization function defined earlier to tokenize the data. | |
The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first. | |
For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set. | |
```bash | |
token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm') | |
token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm') | |
def yield_tokens(data_iter: Iterable, language: str) -> List[str]: | |
language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1} | |
for data_sample in data_iter: | |
yield token_transform[language](data_sample[language_index[language]]) | |
# Define special symbols and indices | |
UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3 | |
# Make sure the tokens are in order of their indices to properly insert them in vocab | |
special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>'] | |
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: | |
# Training data Iterator | |
train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE)) | |
vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln), | |
min_freq=1, | |
specials=special_symbols, | |
special_first=True) | |
for ln in [SRC_LANGUAGE, TGT_LANGUAGE]: | |
vocab_transform[ln].set_default_index(UNK_IDX) | |
``` | |
# Model Architecture | |
For machine translation I used Seq2SeqTransformer. | |
class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer. | |
The parameters defined and initialized for the model are: | |
### num_encoder_layers: Number of layers in the encoder stack -- 3. | |
### num_decoder_layers: Number of layers in the decoder stack-- 3. | |
### emb_size: The dimensionality of token embeddings-- 512. | |
### nhead: The number of attention heads in the multi-head attention mechanism-- 512. | |
### src_vocab_size: Vocabulary size of the source language. | |
### tgt_vocab_size: Vocabulary size of the target language. | |
### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512). | |
### dropout: Dropout probability (defaulted to 0.1). | |
The loss function and optimizer are calculated with this: | |
```bash | |
loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX) | |
optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9) | |
``` | |
Then the model is passed through encoder and decoder layers. | |
The helper functions and list are | |
```bash | |
sequential_transforms(*transforms) | |
tensor_transform(token_ids: List[int]) | |
collate_fn(batch) | |
text_transform = {} | |
``` | |
These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model. | |
Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model). | |
# Result Analysis | |
greedy_decode() - this function takes | |
### model: The sequence-to-sequence transformer model. | |
### src: The source sequence tensor. | |
### src_mask: The mask for the source sequence. | |
### max_len: The maximum length of the output sequence. | |
### start_symbol: The index of the start symbol in the target vocabulary | |
as parameter and returns the generated target sequence tensor ys, which contains the complete translation. | |
## Test input: | |
The function for translating german to english is - translate(). | |
```bash | |
def translate(src_sentence: str): | |
model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM) | |
model.load_state_dict(torch.load('./transformer_model.pth')) | |
model.to(DEVICE) | |
model.eval() | |
src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1) | |
num_tokens = src.shape[0] | |
src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool) | |
tgt_tokens = greedy_decode( | |
model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten() | |
return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "") | |
``` | |
This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the | |
output. | |
# Hugging Face Interface: | |
For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded. | |
```bash | |
import gradio as gr | |
import torch | |
from germantoenglish import Seq2SeqTransformer, translate, greedy_decode | |
``` | |
The the app takes input a german line and output shows the translated english text. | |
```bash | |
if __name__ == "__main__": | |
iface = gr.Interface( | |
fn=translate, | |
inputs=[ | |
gr.components.Textbox(label="Text") | |
], | |
outputs=["text"], | |
cache_examples=False, | |
title="GermanToEnglish", | |
) | |
iface.launch(share=True) | |
``` | |
The app interface looks like this: | |
 | |
# Project Structure | |
```bash | |
|---Readme.md | |
| | |
|---germantoenglish.py-The full code for processing, training, evaluating is here | |
| | |
|---app.py- This is for creating the app interface | |
| | |
|---Modeltensors- needed tensor file for loading app | |
| | |
|---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work. | |
| | |
|--translate_model.pth- the model file which is loaded for the app | |
``` | |
# How to Run | |
```bash | |
git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish | |
cd GermanToEnglish | |
pip install -r requirements.txt | |
python app.py | |
``` | |
# License | |
This project is licensed under the MIT License. | |
# Contributor | |
Neelima Monjusha Preeti - [email protected] | |
App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish | |