Model description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M on an Indonesian-English CoVoST2 dataset.

Intended uses & limitations

This model is used to predict the translation of Indonesian Transcription.

How to Use

This is how to use the model with Faster-Whisper.

Convert the model into the CTranslate2 format with float16 quantization.

!ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16

Load the converted model using ctranslate2 library.

 from faster_whisper import WhisperModel
 import os

 ct2_model_name = "ct2-nllb-indo-en-float16"
 
 ct_model_path = os.path.join("ct2", ct2_model_name)
 translator = ctranslate2.Translator(ct_model_path, device=device)

Download the SentencePiece model

!wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model

Load the SentencePiece model

import sentencepiece as spm

sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model")

sp = spm.SentencePieceProcessor()
sp.load(sp_model_path)

Now, the loaded model can be used.

 src_lang = "ind_Latn"
 tgt_lang = "eng_Latn"
 
 beam_size = 5
 
 source_sentences = lst_of_sentences
 
 source_sents = [sent.strip() for sent in source_sentences]
 target_prefix = [[tgt_lang]] * len(source_sents)
 
 # Chunk source sentences into subword
 source_sents_subworded = sp.encode_as_pieces(source_sents)
 source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
 
 # Translate the source sentences
 translations = translator.translate_batch(source_sents_subworded,
                                           batch_type="tokens",
                                           max_batch_size=2024,
                                           beam_size=beam_size,
                                           target_prefix=target_prefix)
 translations = [translation.hypotheses[0] for translation in translations]
 
 # Merge all of the subword in the target sentences
 translations_desubword = sp.decode(translations)
 translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]

Note: If you faced the kernel error everytime running the code above. You have to install nvidia-cublas and nvidia-cudnn

apt update
apt install libcudnn9-cuda-12

and Install the library using pip. Read The Documentation for more.

pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*

export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`

Big shout out to Yasmin Moslem, for solving this issue.

Training procedure

Training Results

Epoch	Training Loss	Validation Loss	BLEU
1	0.119100	0.048539	60.267190
2	0.020900	0.044844	59.821654
3	0.014600	0.048637	60.185481
4	0.007200	0.052005	60.150045
5	0.005100	0.054909	59.938441
6	0.004500	0.056668	60.032409
7	0.003800	0.058903	60.176242
8	0.002900	0.059880	60.168394
9	0.002400	0.060914	60.280851

Model Evaluation

The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. This fine-tuned model shows some improvement over the baseline model.

Evaluation details

BLEU: Measures the overlap between predicted and reference text based on n-grams.
CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

Credits

Huge thanks to Yasmin Moslem for mentoring me.

cobrayyxx
/

nllb-indo-en-covost2