Spaces:

neelimapreeti297
/

GermanToEnglish

Runtime error

App Files Files Community

GermanToEnglish / README.md

neelimapreeti297

Update README.md

9594397 verified over 1 year ago

preview code

raw

history blame contribute delete

7.73 kB

	---
	title: GermanToEnglish
	emoji: 🔥
	colorFrom: gray
	colorTo: yellow
	sdk: gradio
	sdk_version: 4.25.0
	app_file: app.py
	pinned: false
	license: mit
	---
	The translator app:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/nVheCVJjZiCK3cvof6x84.png)

	# Model Name
	German to English Translator

	# Model Description
	This model translates german language to english language. It used Sequence to Sequence Transformer(Seq2SeqTransformer) for training.


	- Developed by: Neelima Monjusha Preeti
	- Model type: Seq2SeqTransformer
	- Language(s): Python
	- License: MIT
	- Contact: [email protected]

	# Task Description
	This app translates German to English. First the language is tokenized, passed through encoder, decoder and trained with Seq2SeqTransformer.
	Then as output the language is english.

	# Data Processing

	Defining source and target languages and then Tokenization. Tokenizers for German and English are initialized using spaCy (spacy library).
	The get_tokenizer function from spaCy is used to obtain tokenizers for each language.
	A function yield_tokens is defined to tokenize sentences from the data iterator for both source and target languages.
	Special symbols and indices:

	Special indices are defined for unknown words (UNK_IDX), padding (PAD_IDX), beginning of sequence (BOS_IDX), and end of sequence (EOS_IDX).
	Special symbols are defined as ['<unk>', '<pad>', '<bos>', '<eos>'].

	Then vocabulary is built.For each language (source and target), the code iterates over the training data and builds a vocabulary using the build_vocab_from_iterator function.
	It uses the tokenization function defined earlier to tokenize the data.
	The vocabulary is built with a minimum frequency of 1 (including all tokens) and special symbols are added first.
	For each language's vocabulary, the default index for unknown tokens (UNK_IDX) is set.

	```bash
	token_transform[SRC_LANGUAGE] = get_tokenizer('spacy', language='de_core_news_sm')
	token_transform[TGT_LANGUAGE] = get_tokenizer('spacy', language='en_core_web_sm')


	def yield_tokens(data_iter: Iterable, language: str) -> List[str]:
	language_index = {SRC_LANGUAGE: 0, TGT_LANGUAGE: 1}

	for data_sample in data_iter:
	yield token_transform[language](data_sample[language_index[language]])

	# Define special symbols and indices
	UNK_IDX, PAD_IDX, BOS_IDX, EOS_IDX = 0, 1, 2, 3
	# Make sure the tokens are in order of their indices to properly insert them in vocab
	special_symbols = ['<unk>', '<pad>', '<bos>', '<eos>']

	for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
	# Training data Iterator
	train_iter = Multi30k(split='train', language_pair=(SRC_LANGUAGE, TGT_LANGUAGE))

	vocab_transform[ln] = build_vocab_from_iterator(yield_tokens(train_iter, ln),
	min_freq=1,
	specials=special_symbols,
	special_first=True)

	for ln in [SRC_LANGUAGE, TGT_LANGUAGE]:
	vocab_transform[ln].set_default_index(UNK_IDX)
	```
	# Model Architecture

	For machine translation I used Seq2SeqTransformer.
	class PositionalEncoding(nn.Module) adds positional encodings to token embeddings, while class TokenEmbedding(nn.Module) converts token indices into dense embeddings using an embedding layer.
	The parameters defined and initialized for the model are:

	### num_encoder_layers: Number of layers in the encoder stack -- 3.
	### num_decoder_layers: Number of layers in the decoder stack-- 3.
	### emb_size: The dimensionality of token embeddings-- 512.
	### nhead: The number of attention heads in the multi-head attention mechanism-- 512.
	### src_vocab_size: Vocabulary size of the source language.
	### tgt_vocab_size: Vocabulary size of the target language.
	### dim_feedforward: Dimensionality of the feedforward network (defaulted to 512).
	### dropout: Dropout probability (defaulted to 0.1).

	The loss function and optimizer are calculated with this:

	```bash
	loss_fn = torch.nn.CrossEntropyLoss(ignore_index=PAD_IDX)
	optimizer = torch.optim.Adam(transformer.parameters(), lr=0.0001, betas=(0.9, 0.98), eps=1e-9)
	```
	Then the model is passed through encoder and decoder layers.

	The helper functions and list are

	```bash
	sequential_transforms(*transforms)
	tensor_transform(token_ids: List[int])
	collate_fn(batch)
	text_transform = {}
	```

	These utility functions and transformations handle the preprocessing of text data, including tokenization, numericalization, adding special tokens, and collating samples into batch tensors suitable for training a sequence-to-sequence transformer model.

	Then the model is trained with Seq2SeqTransformer and evaluated with function evaluate(model).

	# Result Analysis
	greedy_decode() - this function takes
	### model: The sequence-to-sequence transformer model.
	### src: The source sequence tensor.
	### src_mask: The mask for the source sequence.
	### max_len: The maximum length of the output sequence.
	### start_symbol: The index of the start symbol in the target vocabulary

	as parameter and returns the generated target sequence tensor ys, which contains the complete translation.

	## Test input:

	The function for translating german to english is - translate().
	```bash
	def translate(src_sentence: str):
	model = Seq2SeqTransformer(NUM_ENCODER_LAYERS, NUM_DECODER_LAYERS, EMB_SIZE,NHEAD, SRC_VOCAB_SIZE, TGT_VOCAB_SIZE, FFN_HID_DIM)

	model.load_state_dict(torch.load('./transformer_model.pth'))
	model.to(DEVICE)
	model.eval()
	src = text_transform[SRC_LANGUAGE](src_sentence).view(-1, 1)
	num_tokens = src.shape[0]
	src_mask = (torch.zeros(num_tokens, num_tokens)).type(torch.bool)
	tgt_tokens = greedy_decode(
	model, src, src_mask, max_len=num_tokens + 5, start_symbol=BOS_IDX).flatten()
	return " ".join(vocab_transform[TGT_LANGUAGE].lookup_tokens(list(tgt_tokens.cpu().numpy()))).replace("<bos>", "").replace("<eos>", "")
	```
	This function first loads the saved model. Then it tokenizes and implements greedy_decode for getting the translated output. Then returns the
	output.

	# Hugging Face Interface:

	For creating interface gradio and torch as well as Seq2SeqTransformer, translate and greedy_decode function from the germantoenglish.py file was loaded.
	```bash
	import gradio as gr
	import torch
	from germantoenglish import Seq2SeqTransformer, translate, greedy_decode
	```
	The the app takes input a german line and output shows the translated english text.
	```bash
	if __name__ == "__main__":
	iface = gr.Interface(
	fn=translate,
	inputs=[
	gr.components.Textbox(label="Text")

	],
	outputs=["text"],
	cache_examples=False,
	title="GermanToEnglish",
	)
	iface.launch(share=True)

	```
	The app interface looks like this:

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65b2665fee3f66b2b0f7b765/J_Q4eqXiN7cNuhOM3NbjR.png)

	# Project Structure
	```bash
	\|---Readme.md
	\|
	\|---germantoenglish.py-The full code for processing, training, evaluating is here
	\|
	\|---app.py- This is for creating the app interface
	\|
	\|---Modeltensors- needed tensor file for loading app
	\|
	\|---requirements.txt- necessary packages and dataset which needs to be downloaded for the app to work.
	\|
	\|--translate_model.pth- the model file which is loaded for the app

	```

	# How to Run

	```bash

	git clone https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish

	cd GermanToEnglish

	pip install -r requirements.txt

	python app.py
	```


	# License
	This project is licensed under the MIT License.

	# Contributor
	Neelima Monjusha Preeti - [email protected]

	App link: https://huggingface.co/spaces/neelimapreeti297/GermanToEnglish