Is the Conversion Process Purely a Format Conversion Without Additional Training?

#5
by hwanython - opened

I would like to confirm that during the conversion of the deepdml/whisper-large-v3-turbo model to the CTranslate2 format, no additional training or fine-tuning is performed; the process is strictly a conversion of the existing model weights into a format optimized for inference.

Exactly, is strictly a conversion of the existing model weights into a format optimized for inference.

Thank you!!

I am very interested in the whisper model. can I ask something?

Are the Repeated Filler Outputs a Result of a Conversion or Decoding Issue?

"I am using whisper-large-v3-turbo to transcribe audio, and I encountered an issue where a short utterance (just a brief 'μ–΄' sound) is transcribed as an excessively long repeated sequence of filler words (e.g., repeated 'λ„€'). I would like to know what the general causes for this phenomenon might be and what alternatives or adjustments can be made to mitigate it. Specifically, is this behavior related to the decoding parameters (such as temperature or beam search settings), audio segmentation issues, or model biases in handling short utterances?

The hallucinatios are part of Whisper. I recommend you to search about it on Whisper repo: https://github.com/openai/whisper/discussions?discussions_q=hallucination

And some solutions: use VAD, tuning parameters as temperature, beam, with transformers library in some cases help using return_timestamps="word",

Sign up or log in to comment