File size: 3,364 Bytes
ce4167f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# Neural Machine Translation for English-Hindi

This project implements a Neural Machine Translation system for English-Hindi translation using the MarianMT model fine-tuned on 100k split of Samanantar, with a user-friendly Gradio interface.

![NMT UI Screenshot](assets/nmt_ui_screenshot.png)

## Features

- Unidirectional translation between English and Hindi
- User-friendly web interface built with Gradio
- Example translations included
- Built on Helsinki-NLP's MarianMT model

## Installation

### Local Setup with Virtual Environment

1. Clone the repository:
```bash
git clone https://github.com/yourusername/NLPA_Assignment_2_Group_54.git
cd NLPA_Assignment_2_Group_54
```

2. Create and activate a virtual environment:
```bash
python -m venv venv
source venv/bin/activate  # On Windows, use: venv\Scripts\activate
```

3. Install the required packages:
```bash
pip install -r requirements.txt
```

## Usage

1. Make sure your virtual environment is activated
2. Run the UI:
```bash
python nmt_ui.py
```
3. Open your browser and navigate to `http://localhost:7860`

## Supported Language Pairs

- English -> Hindi (using rooftopcoder/opus-mt-en-hi-samanantar-100k model)

## Training the Model

The `train.py` script is used to train the MarianMT model on the Samanantar dataset. The script performs the following steps:
- Loads the Samanantar dataset (English-Hindi subset).
- Splits the dataset into training and validation sets.
- Tokenizes the dataset.
- Sets up training arguments optimized for GPU.
- Trains the model using the Hugging Face `Trainer` class.
- Saves the trained model to the specified directory.
- Uploads the trained model to the Hugging Face Hub.

To train the model, run:
```bash
python train.py
```

## Testing the Model

The `model_test.py` script is used to test the trained MarianMT model. The script performs the following steps:
- Loads the trained model and tokenizer from the Hugging Face Hub.
- Translates a sample input text from English to Hindi.
- Prints the translated text.

To test the model, run:
```bash
python model_test.py
```

## User Interface

The `nmt_ui.py` script provides a Gradio-based user interface for translating text between English and Hindi. The interface includes options for transliteration of Romanized Hindi text to Devanagari script.

To launch the interface, run:
```bash
python nmt_ui.py
```

## Model Information

This project uses the MarianMT model from Hugging Face Transformers.

### Notes:
- The model supports English-Hindi translation.
- Based on the Helsinki-NLP/opus-mt-en-hi model.
- Optimized for English -> Hindi translation pairs.
- Includes transliteration support for Romanized Hindi text.

### Supported Features:
- English -> Hindi translation.
- Romanized Hindi -> Devanagari Hindi transliteration.

### Examples of Transliteration:
- "namaste" → "नमस्ते"
- "aap kaise ho" → "आप कैसे हो"
- "mera naam" → "मेरा नाम"

## Project Structure

```
NLPA_Assignment_2_Group_54/
├── nmt_ui.py        # Main application file with Gradio interface
├── requirements.txt  # Python dependencies
└── README.md        # Project documentation
```

## License

MIT

## Group Members

- Shubhra J Gadhwala: 2023aa05750
- Sandeep Kumar Yadav: 2023ab05047
- Ravi Krishna Mayura: 2023ab05157
- Satheesh Kumar G: 2023ab05041