Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -1,3 +1,50 @@
|
|
| 1 |
-
|
| 2 |
-
|
| 3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# Neural Network-Based Language Model for Next Token Prediction
|
| 2 |
+
|
| 3 |
+
## Overview
|
| 4 |
+
This project is a midterm assignment focused on developing a neural network-based language model for next token prediction. The model was trained using a custom dataset with two languages, English and Amharic. The project incorporates techniques in neural networks to predict the next token in a sequence, demonstrating a non-transformer approach to language modeling.
|
| 5 |
+
|
| 6 |
+
## Project Objectives
|
| 7 |
+
The main objective of this project was to:
|
| 8 |
+
- Develop a neural network-based model for next token prediction without using transformers or encoder-decoder architectures.
|
| 9 |
+
- Experiment with multiple languages to observe model performance.
|
| 10 |
+
- Implement checkpointing to save model progress and generate text during different training stages.
|
| 11 |
+
- Present a video demo showcasing the model's performance in generating text in both English and Amharic.
|
| 12 |
+
|
| 13 |
+
## Project Details
|
| 14 |
+
|
| 15 |
+
### 1. Training Languages
|
| 16 |
+
The model was trained using datasets in English and Amharic. The datasets were cleaned and prepared, including tokenization and embedding for improved model training.
|
| 17 |
+
|
| 18 |
+
### 2. Tokenizer
|
| 19 |
+
A custom tokenizer was created using Byte Pair Encoding (BPE). This tokenizer was trained on five languages: English, Amharic, Sanskrit, Nepali, and Hindi, but the model specifically utilized English and Amharic for this task.
|
| 20 |
+
|
| 21 |
+
### 3. Embedding Model
|
| 22 |
+
A custom embedding model was employed to convert tokens into vector representations, allowing the neural network to better understand the structure and meaning of the input data.
|
| 23 |
+
|
| 24 |
+
### 4. Model Architecture
|
| 25 |
+
The project uses an LSTM (Long Short-Term Memory) neural network to predict the next token in a sequence. LSTMs are well-suited for sequential data and are a popular choice for language modeling due to their ability to capture long-term dependencies.
|
| 26 |
+
|
| 27 |
+
## Results and Evaluation
|
| 28 |
+
|
| 29 |
+
### Training Curve and Loss
|
| 30 |
+
The model’s training and validation loss over time are documented and included in the repository (`loss_values.csv`). The training curve demonstrates the model's learning progress, with explanations provided for key observations in the loss trends.
|
| 31 |
+
|
| 32 |
+
### Checkpoint Implementation
|
| 33 |
+
Checkpointing was implemented to save model states at different training stages, allowing for partial model evaluations and text generation demos. Checkpoints are included in the repository for reference.
|
| 34 |
+
|
| 35 |
+
### Perplexity Score
|
| 36 |
+
The model's perplexity score, calculated during training, is available in the `perplexity.csv` file. This score provides an indication of the model's predictive accuracy over time.
|
| 37 |
+
|
| 38 |
+
## Demonstration
|
| 39 |
+
A video demo, linked below, demonstrates:
|
| 40 |
+
- Random initialization text generation in English.
|
| 41 |
+
- Text generation using the trained model in both English and Amharic, with English translations provided using Google Translate.
|
| 42 |
+
|
| 43 |
+
**Video Demo Link:** [YouTube Demo](https://youtu.be/1m21NYmLSC4)
|
| 44 |
+
|
| 45 |
+
## Instructions for Reproducing the Results
|
| 46 |
+
1. Install dependencies (Python, PyTorch, and other required libraries).
|
| 47 |
+
2. Load the .ipynb notebook and run cells sequentially to replicate training and evaluation.
|
| 48 |
+
3. Refer to HuggingFace documentation for downloading the model and tokenizer files.
|
| 49 |
+
|
| 50 |
+
|