|
--- |
|
library_name: transformers |
|
language: |
|
- zh |
|
license: mit |
|
base_model: microsoft/Phi-4-multimodal-instruct |
|
tags: |
|
- automatic-speech-recognition |
|
- audio |
|
- speech |
|
- generated_from_trainer |
|
datasets: |
|
- JacobLinCool/common_voice_19_0_zh-TW |
|
metrics: |
|
- wer |
|
- cer |
|
model-index: |
|
- name: Phi-4-multimodal-instruct-commonvoice-zh-tw |
|
results: |
|
- task: |
|
type: automatic-speech-recognition |
|
name: Automatic Speech Recognition |
|
dataset: |
|
name: JacobLinCool/common_voice_19_0_zh-TW |
|
type: JacobLinCool/common_voice_19_0_zh-TW |
|
metrics: |
|
- type: wer |
|
value: 31.18 |
|
name: Wer |
|
- type: cer |
|
value: 6.67 |
|
name: Cer |
|
--- |
|
|
|
# Phi-4-multimodal-instruct-commonvoice-zh-tw |
|
|
|
This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the [Common Voice 19.0 Taiwanese Mandarin dataset](https://huggingface.co/datasets/JacobLinCool/common_voice_19_0_zh-TW). |
|
|
|
- WER: 31.18% |
|
- CER: 6.67% |
|
|
|
## Model description |
|
|
|
Phi-4-multimodal-instruct-commonvoice-zh-tw is a multimodal language model fine-tuned for Automated Speech Recognition (ASR) of Taiwanese Mandarin (zh-TW). The base model is Microsoft's Phi-4-multimodal-instruct, which was further trained on speech transcription tasks. |
|
|
|
The model accepts audio input and produces Traditional Chinese text transcriptions. It has been specifically optimized to recognize Taiwanese Mandarin speech patterns and vocabulary. |
|
|
|
## Intended uses & limitations |
|
|
|
This model is intended for: |
|
- Transcribing spoken Taiwanese Mandarin to text |
|
- Automated subtitling/captioning for zh-TW content |
|
- Speech-to-text applications requiring Taiwanese Mandarin support |
|
|
|
Limitations: |
|
- Performance may vary with background noise, speaking speed, or accents |
|
- The model performs best with clear audio input |
|
- Specialized terminology or domain-specific vocabulary may have lower accuracy |
|
|
|
## Training and evaluation data |
|
|
|
The model was fine-tuned on Common Voice 19.0 Taiwanese Mandarin dataset. Common Voice is a crowdsourced speech dataset containing contributions from volunteers who record themselves reading sentences in various languages. |
|
|
|
The evaluation was performed on the test split of the same dataset, consisting of 5,013 samples. |
|
|
|
## Training procedure |
|
|
|
The model was trained using LoRA adapters focused on the speech recognition components of the base model, allowing for efficient fine-tuning while preserving the general capabilities of the underlying Phi-4 model. |
|
|
|
### Prompt format |
|
|
|
This model follows the prompt template from the original paper. For speech recognition tasks, the audio input is provided inline with a simple instruction: |
|
|
|
``` |
|
<|user|> |
|
<|audio_1|> Transcribe the audio clip into text. |
|
<|assistant|> |
|
[Transcription output in Traditional Chinese] |
|
<|end|> |
|
``` |
|
|
|
### Training hyperparameters |
|
|
|
The following hyperparameters were used during training: |
|
- learning_rate: 4e-05 |
|
- train_batch_size: 4 |
|
- eval_batch_size: 8 |
|
- seed: 42 |
|
- gradient_accumulation_steps: 32 |
|
- total_train_batch_size: 128 |
|
- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments |
|
- lr_scheduler_type: linear |
|
- lr_scheduler_warmup_steps: 50 |
|
- num_epochs: 2 |
|
|
|
### Training results |
|
|
|
The model achieved the following performance metrics on the test set: |
|
- Word Error Rate (WER): 31.18% |
|
- Character Error Rate (CER): 6.67% |
|
- Number of evaluation samples: 5,013 |
|
|
|
### Framework versions |
|
|
|
- Transformers 4.49.0 |
|
- Pytorch 2.4.1+cu124 |
|
- Datasets 3.3.2 |
|
- Tokenizers 0.21.1 |
|
|
|
## How to use |
|
|
|
```python |
|
import torch |
|
from transformers import AutoProcessor, AutoModelForCausalLM |
|
import librosa |
|
|
|
AUDIO_PATH = "test.wav" |
|
|
|
MODEL = "JacobLinCool/Phi-4-multimodal-instruct-commonvoice-zh-tw" |
|
DEVICE = "cuda" if torch.cuda.is_available() else "cpu" |
|
USE_FA = True |
|
|
|
processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True) |
|
|
|
model = AutoModelForCausalLM.from_pretrained( |
|
MODEL, |
|
torch_dtype=torch.bfloat16 if USE_FA else torch.float32, |
|
_attn_implementation="flash_attention_2" if USE_FA else "sdpa", |
|
trust_remote_code=True, |
|
).to(DEVICE) |
|
|
|
audio, sr = librosa.load(AUDIO_PATH, sr=16000) |
|
|
|
# Prepare the user message and generate the prompt |
|
user_message = { |
|
"role": "user", |
|
"content": "<|audio_1|> Transcribe the audio clip into text.", |
|
} |
|
prompt = processor.tokenizer.apply_chat_template( |
|
[user_message], tokenize=False, add_generation_prompt=True |
|
) |
|
|
|
# Build the inputs for the model |
|
inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt") |
|
inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()} |
|
|
|
# Generate transcription without gradients |
|
with torch.no_grad(): |
|
generated_ids = model.generate( |
|
**inputs, |
|
eos_token_id=processor.tokenizer.eos_token_id, |
|
max_new_tokens=64, |
|
do_sample=False, |
|
) |
|
|
|
# Decode the generated token IDs into a human-readable transcription |
|
transcription = processor.decode( |
|
generated_ids[0, inputs["input_ids"].shape[1] :], |
|
skip_special_tokens=True, |
|
clean_up_tokenization_spaces=False, |
|
) |
|
|
|
# Print the transcription |
|
print(transcription) |
|
``` |
|
|