Model description

This model is a fine-tuned version of facebook/nllb-200-distilled-600M on an Indonesian-English CoVoST2 dataset.

Intended uses & limitations

This model is used to predict the translation of Indonesian Transcription.

How to Use

This is how to use the model with Faster-Whisper.

  1. Convert the model into the CTranslate2 format with float16 quantization.

    !ct2-transformers-converter --model cobrayyxx/nllb-indo-en-covost2 --quantization float16 --output_dir ct2/ct2-nllb-indo-en-float16
    
  2. Load the converted model using ctranslate2 library.

     from faster_whisper import WhisperModel
     import os
    
     ct2_model_name = "ct2-nllb-indo-en-float16"
     
     ct_model_path = os.path.join("ct2", ct2_model_name)
     translator = ctranslate2.Translator(ct_model_path, device=device)
    
  3. Download the SentencePiece model

    !wget https://s3.amazonaws.com/opennmt-models/nllb-200/flores200_sacrebleu_tokenizer_spm.model
    
  4. Load the SentencePiece model

    import sentencepiece as spm
    
    sp_model_path = os.path.join(directory, "flores200_sacrebleu_tokenizer_spm.model")
    
    sp = spm.SentencePieceProcessor()
    sp.load(sp_model_path)
    
  5. Now, the loaded model can be used.

     src_lang = "ind_Latn"
     tgt_lang = "eng_Latn"
     
     beam_size = 5
     
     source_sentences = lst_of_sentences
     
     source_sents = [sent.strip() for sent in source_sentences]
     target_prefix = [[tgt_lang]] * len(source_sents)
     
     # Chunk source sentences into subword
     source_sents_subworded = sp.encode_as_pieces(source_sents)
     source_sents_subworded = [[src_lang] + sent + ["</s>"] for sent in source_sents_subworded]
     
     # Translate the source sentences
     translations = translator.translate_batch(source_sents_subworded,
                                               batch_type="tokens",
                                               max_batch_size=2024,
                                               beam_size=beam_size,
                                               target_prefix=target_prefix)
     translations = [translation.hypotheses[0] for translation in translations]
     
     # Merge all of the subword in the target sentences
     translations_desubword = sp.decode(translations)
     translations_desubword = [sent[len(tgt_lang):].strip() for sent in translations_desubword]
    

    Note: If you faced the kernel error everytime running the code above. You have to install nvidia-cublas and nvidia-cudnn

    apt update
    apt install libcudnn9-cuda-12
    

    and Install the library using pip. Read The Documentation for more.

    pip install nvidia-cublas-cu12 nvidia-cudnn-cu12==9.*
    
    export LD_LIBRARY_PATH=`python3 -c 'import os; import nvidia.cublas.lib; import nvidia.cudnn.lib; print(os.path.dirname(nvidia.cublas.lib.__file__) + ":" + os.path.dirname(nvidia.cudnn.lib.__file__))'`
    

    Big shout out to Yasmin Moslem, for solving this issue.

Training procedure

Training Results

Epoch Training Loss Validation Loss BLEU
1 0.119100 0.048539 60.267190
2 0.020900 0.044844 59.821654
3 0.014600 0.048637 60.185481
4 0.007200 0.052005 60.150045
5 0.005100 0.054909 59.938441
6 0.004500 0.056668 60.032409
7 0.003800 0.058903 60.176242
8 0.002900 0.059880 60.168394
9 0.002400 0.060914 60.280851

Model Evaluation

The performance of the baseline and fine-tuned model were evaluated using the BLEU and CHRF++ metrics on the validation dataset. This fine-tuned model shows some improvement over the baseline model.

Evaluation details

  • BLEU: Measures the overlap between predicted and reference text based on n-grams.
  • CHRF: Uses character n-grams for evaluation, making it particularly suitable for morphologically rich languages.

Credits

Huge thanks to Yasmin Moslem for mentoring me.

Downloads last month
20
Safetensors
Model size
615M params
Tensor type
F32
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no library tag.

Model tree for cobrayyxx/nllb-indo-en-covost2

Finetuned
(104)
this model

Dataset used to train cobrayyxx/nllb-indo-en-covost2