JacobLinCool
/

Phi-4-multimodal-instruct-commonvoice-zh-tw

@@ -1,35 +1,84 @@
 ---
 library_name: transformers
 license: mit
 base_model: microsoft/Phi-4-multimodal-instruct
 tags:
 - generated_from_trainer
 model-index:
 - name: Phi-4-multimodal-instruct-commonvoice-zh-tw
-  results: []
 ---
-<!-- This model card has been generated automatically according to the information the Trainer had access to. You
-should probably proofread and complete it, then remove this comment. -->
 # Phi-4-multimodal-instruct-commonvoice-zh-tw
-This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on an unknown dataset.
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
 ## Training and evaluation data
-More information needed
 ## Training procedure
 ### Training hyperparameters
 The following hyperparameters were used during training:
@@ -46,7 +95,10 @@ The following hyperparameters were used during training:
 ### Training results
 ### Framework versions
@@ -54,3 +106,60 @@ The following hyperparameters were used during training:
 - Pytorch 2.4.1+cu124
 - Datasets 3.3.2
 - Tokenizers 0.21.1

 ---
 library_name: transformers
+language:
+- zh
 license: mit
 base_model: microsoft/Phi-4-multimodal-instruct
 tags:
+- automatic-speech-recognition
+- audio
+- speech
 - generated_from_trainer
+datasets:
+- JacobLinCool/common_voice_19_0_zh-TW
+metrics:
+- wer
+- cer
 model-index:
 - name: Phi-4-multimodal-instruct-commonvoice-zh-tw
+  results:
+  - task:
+      type: automatic-speech-recognition
+      name: Automatic Speech Recognition
+    dataset:
+      name: JacobLinCool/common_voice_19_0_zh-TW
+      type: JacobLinCool/common_voice_19_0_zh-TW
+    metrics:
+    - type: wer
+      value: 31.18
+      name: Wer
+    - type: cer
+      value: 6.67
+      name: Cer
 ---
 # Phi-4-multimodal-instruct-commonvoice-zh-tw
+This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the [Common Voice 19.0 Taiwanese Mandarin dataset](https://huggingface.co/datasets/JacobLinCool/common_voice_19_0_zh-TW).
+- WER: 31.18%
+- CER: 6.67%
 ## Model description
+Phi-4-multimodal-instruct-commonvoice-zh-tw is a multimodal language model fine-tuned for Automated Speech Recognition (ASR) of Taiwanese Mandarin (zh-TW). The base model is Microsoft's Phi-4-multimodal-instruct, which was further trained on speech transcription tasks.
+The model accepts audio input and produces Traditional Chinese text transcriptions. It has been specifically optimized to recognize Taiwanese Mandarin speech patterns and vocabulary.
 ## Intended uses & limitations
+This model is intended for:
+- Transcribing spoken Taiwanese Mandarin to text
+- Automated subtitling/captioning for zh-TW content
+- Speech-to-text applications requiring Taiwanese Mandarin support
+Limitations:
+- Performance may vary with background noise, speaking speed, or accents
+- The model performs best with clear audio input
+- Specialized terminology or domain-specific vocabulary may have lower accuracy
 ## Training and evaluation data
+The model was fine-tuned on Common Voice 19.0 Taiwanese Mandarin dataset. Common Voice is a crowdsourced speech dataset containing contributions from volunteers who record themselves reading sentences in various languages.
+The evaluation was performed on the test split of the same dataset, consisting of 5,013 samples.
 ## Training procedure
+The model was trained using LoRA adapters focused on the speech recognition components of the base model, allowing for efficient fine-tuning while preserving the general capabilities of the underlying Phi-4 model.
+### Prompt format
+This model follows the prompt template from the original paper. For speech recognition tasks, the audio input is provided inline with a simple instruction:
+```
+<|user|>
+<|audio_1|> Transcribe the audio clip into text.
+<|assistant|>
+[Transcription output in Traditional Chinese]
+<|end|>
+```
 ### Training hyperparameters
 The following hyperparameters were used during training:
 ### Training results
+The model achieved the following performance metrics on the test set:
+- Word Error Rate (WER): 31.18%
+- Character Error Rate (CER): 6.67%
+- Number of evaluation samples: 5,013
 ### Framework versions
 - Pytorch 2.4.1+cu124
 - Datasets 3.3.2
 - Tokenizers 0.21.1
+## How to use
+```python
+import torch
+from transformers import AutoProcessor, AutoModelForCausalLM
+import librosa
+AUDIO_PATH = "test.wav"
+MODEL = "JacobLinCool/Phi-4-multimodal-instruct-commonvoice-zh-tw"
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+USE_FA = True
+processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained(
+    MODEL,
+    torch_dtype=torch.bfloat16 if USE_FA else torch.float32,
+    _attn_implementation="flash_attention_2" if USE_FA else "sdpa",
+    trust_remote_code=True,
+).to(DEVICE)
+audio, sr = librosa.load(AUDIO_PATH, sr=16000)
+# Prepare the user message and generate the prompt
+user_message = {
+    "role": "user",
+    "content": "<|audio_1|> Transcribe the audio clip into text.",
+}
+prompt = processor.tokenizer.apply_chat_template(
+    [user_message], tokenize=False, add_generation_prompt=True
+)
+# Build the inputs for the model
+inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt")
+inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
+# Generate transcription without gradients
+with torch.no_grad():
+    generated_ids = model.generate(
+        **inputs,
+        eos_token_id=processor.tokenizer.eos_token_id,
+        max_new_tokens=64,
+        do_sample=False,
+    )
+# Decode the generated token IDs into a human-readable transcription
+transcription = processor.decode(
+    generated_ids[0, inputs["input_ids"].shape[1] :],
+    skip_special_tokens=True,
+    clean_up_tokenization_spaces=False,
+)
+# Print the transcription
+print(transcription)
+```