Model Summary

বনলতা (Banalata or forest vine) is an LLM aimed at generating Bengali poems.

Banalata has been created by fine-tuning Gemma 2 9B using Unsloth. It has been fine-tuned using the works of a dozen legendary Bengali poets, spanning about 250 years. When prompted with a few words in Bengali along with a poet's name, Banalata can generate poem-like output in Bengali, often following the writing style of the specified poet to an extent. In addition, Banalata also has a little instruction following capability.

Usage

Use the Banalata—Inference (Generate Bengali Poetry) Kaggle notebook to try out Banalata. You can attach and use the latest version of the model in the notebook.

System

A GPU, e.g., Tesla T4, is required to run Banalata. It is recommended to use Unsloth, which provides a faster inference speed.

Implementation requirements

Banalata was trained on a Kaggle notebook with a T4 GPU. The training was run for about 6 epochs. The training time was about 6–7 hours.

The LoRA rank and alpha were set to 8. The maximum sequence length was set to 2048. The batch size and gradient accumulation steps were set to 4. A learning rate of 1e-3 was used.

Model Characteristics

Banalata is primarily a Bengali text generation model, along with some instruction-following capability.

Model initialization

Banalata is created by fine-tuning Gemma 2 9B, a pre-trained model.

Model stats

The baseline Gemma model has 9B parameters.

Data Overview

The training data are in Bengali. The data used are either available in the public domain or handcrafted from scratch.

Training data

Two different datasets are used for fine-tuning:

Bengali Poems: It consists of poems and songs by a dozen different Bengali poets. All these works are available in the public domain. This dataset has been curated by me by copying the poems from different websites. The contents of this dataset correspond to the period approximately between 1718 and 1954. The geographical coverage of this dataset is primarily India and Bangladesh. Bengali Poems is a small subset of Bangla Sahitya, primarily capturing poetry and songs.
Bangla Nirdeshabali: A tiny question-answer dataset in Bengali created from scratch, covering different topics, such as literature, culture, and geography. This dataset is also created by me, by consulting various websites and contemporary information. The questions in this dataset primarily relate to poetry analysis and context-based answering, e.g., explain this verse or which character is mentioned in this paragraph. There are some multiple-choice questions. There are also some questions of the "fill-in-the-blanks" type, aiming to simulate the Masked Language Modeling effect to an extent. Some other questions pertain to tabular data interpretation, e.g., weather report summarization. The geographical coverage of this dataset is primarily India.

Both of these datasets are made publicly available. The two datasets contain about 0.6 million tokens spread across more than 850 Bengali poems.

Some of the poets whose works are covered by the two datasets include, but are not limited to:

রবীন্দ্রনাথ ঠাকুর (Rabindranath Tagore): He was the first Indian (and Asian) to win a Nobel Prize (in Literature). Tagore's poems, songs, plays, stories, and novels continue to have a profound and regular influence on the lives of Bengalis.
জীবনানন্দ দাশ (Jibanananda Das): He is often regarded as one of the greatest and pioneering poets in the post-Tagore era of Bengali literature. He is perhaps also one of the first Bengali surrealist poets. His poem বনলতা সেন (Banalata Sen) is one of the greatest examples in modern Bengali poetry.
লালন ফকির (Lalan Fakir): One of the greatest baul/folk singers, he advocated humanism, blurring the lines of caste, creed, and religion. Lalan's words are still a ray of hope in today's divided world.

The motivation behind using the two datasets is that Banalata, i.e., fine-tuned Gemma, should have both text generation and instruction following capabilities. Further details on these datasets are available later in this notebook.

Note that contemporary Bengali poems are currently copyrighted and therefore, are not part of these two datasets.

Evaluation data

What was the train / test / dev split? Are there notable differences between training and test data?

Evaluation Results

Summary

Summarize and link to evaluation results for this analysis.

Fairness

N/A

Usage limitations

While Banalata generates Bengali poetry-like output, it is hardly perfect. The output might, for example, exhibit repitions or generate non-existing words in Bengali.

Ethics

N/A

barunsaha
/

banalata