JacobLinCool commited on
Commit
61ee751
·
verified ·
1 Parent(s): ffe3a0f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +118 -9
README.md CHANGED
@@ -1,35 +1,84 @@
1
  ---
2
  library_name: transformers
 
 
3
  license: mit
4
  base_model: microsoft/Phi-4-multimodal-instruct
5
  tags:
 
 
 
6
  - generated_from_trainer
 
 
 
 
 
7
  model-index:
8
  - name: Phi-4-multimodal-instruct-commonvoice-zh-tw
9
- results: []
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- <!-- This model card has been generated automatically according to the information the Trainer had access to. You
13
- should probably proofread and complete it, then remove this comment. -->
14
-
15
  # Phi-4-multimodal-instruct-commonvoice-zh-tw
16
 
17
- This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on an unknown dataset.
 
 
 
18
 
19
  ## Model description
20
 
21
- More information needed
 
 
22
 
23
  ## Intended uses & limitations
24
 
25
- More information needed
 
 
 
 
 
 
 
 
26
 
27
  ## Training and evaluation data
28
 
29
- More information needed
 
 
30
 
31
  ## Training procedure
32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
33
  ### Training hyperparameters
34
 
35
  The following hyperparameters were used during training:
@@ -46,7 +95,10 @@ The following hyperparameters were used during training:
46
 
47
  ### Training results
48
 
49
-
 
 
 
50
 
51
  ### Framework versions
52
 
@@ -54,3 +106,60 @@ The following hyperparameters were used during training:
54
  - Pytorch 2.4.1+cu124
55
  - Datasets 3.3.2
56
  - Tokenizers 0.21.1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  library_name: transformers
3
+ language:
4
+ - zh
5
  license: mit
6
  base_model: microsoft/Phi-4-multimodal-instruct
7
  tags:
8
+ - automatic-speech-recognition
9
+ - audio
10
+ - speech
11
  - generated_from_trainer
12
+ datasets:
13
+ - JacobLinCool/common_voice_19_0_zh-TW
14
+ metrics:
15
+ - wer
16
+ - cer
17
  model-index:
18
  - name: Phi-4-multimodal-instruct-commonvoice-zh-tw
19
+ results:
20
+ - task:
21
+ type: automatic-speech-recognition
22
+ name: Automatic Speech Recognition
23
+ dataset:
24
+ name: JacobLinCool/common_voice_19_0_zh-TW
25
+ type: JacobLinCool/common_voice_19_0_zh-TW
26
+ metrics:
27
+ - type: wer
28
+ value: 31.18
29
+ name: Wer
30
+ - type: cer
31
+ value: 6.67
32
+ name: Cer
33
  ---
34
 
 
 
 
35
  # Phi-4-multimodal-instruct-commonvoice-zh-tw
36
 
37
+ This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the [Common Voice 19.0 Taiwanese Mandarin dataset](https://huggingface.co/datasets/JacobLinCool/common_voice_19_0_zh-TW).
38
+
39
+ - WER: 31.18%
40
+ - CER: 6.67%
41
 
42
  ## Model description
43
 
44
+ Phi-4-multimodal-instruct-commonvoice-zh-tw is a multimodal language model fine-tuned for Automated Speech Recognition (ASR) of Taiwanese Mandarin (zh-TW). The base model is Microsoft's Phi-4-multimodal-instruct, which was further trained on speech transcription tasks.
45
+
46
+ The model accepts audio input and produces Traditional Chinese text transcriptions. It has been specifically optimized to recognize Taiwanese Mandarin speech patterns and vocabulary.
47
 
48
  ## Intended uses & limitations
49
 
50
+ This model is intended for:
51
+ - Transcribing spoken Taiwanese Mandarin to text
52
+ - Automated subtitling/captioning for zh-TW content
53
+ - Speech-to-text applications requiring Taiwanese Mandarin support
54
+
55
+ Limitations:
56
+ - Performance may vary with background noise, speaking speed, or accents
57
+ - The model performs best with clear audio input
58
+ - Specialized terminology or domain-specific vocabulary may have lower accuracy
59
 
60
  ## Training and evaluation data
61
 
62
+ The model was fine-tuned on Common Voice 19.0 Taiwanese Mandarin dataset. Common Voice is a crowdsourced speech dataset containing contributions from volunteers who record themselves reading sentences in various languages.
63
+
64
+ The evaluation was performed on the test split of the same dataset, consisting of 5,013 samples.
65
 
66
  ## Training procedure
67
 
68
+ The model was trained using LoRA adapters focused on the speech recognition components of the base model, allowing for efficient fine-tuning while preserving the general capabilities of the underlying Phi-4 model.
69
+
70
+ ### Prompt format
71
+
72
+ This model follows the prompt template from the original paper. For speech recognition tasks, the audio input is provided inline with a simple instruction:
73
+
74
+ ```
75
+ <|user|>
76
+ <|audio_1|> Transcribe the audio clip into text.
77
+ <|assistant|>
78
+ [Transcription output in Traditional Chinese]
79
+ <|end|>
80
+ ```
81
+
82
  ### Training hyperparameters
83
 
84
  The following hyperparameters were used during training:
 
95
 
96
  ### Training results
97
 
98
+ The model achieved the following performance metrics on the test set:
99
+ - Word Error Rate (WER): 31.18%
100
+ - Character Error Rate (CER): 6.67%
101
+ - Number of evaluation samples: 5,013
102
 
103
  ### Framework versions
104
 
 
106
  - Pytorch 2.4.1+cu124
107
  - Datasets 3.3.2
108
  - Tokenizers 0.21.1
109
+
110
+ ## How to use
111
+
112
+ ```python
113
+ import torch
114
+ from transformers import AutoProcessor, AutoModelForCausalLM
115
+ import librosa
116
+
117
+ AUDIO_PATH = "test.wav"
118
+
119
+ MODEL = "JacobLinCool/Phi-4-multimodal-instruct-commonvoice-zh-tw"
120
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
121
+ USE_FA = True
122
+
123
+ processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True)
124
+
125
+ model = AutoModelForCausalLM.from_pretrained(
126
+ MODEL,
127
+ torch_dtype=torch.bfloat16 if USE_FA else torch.float32,
128
+ _attn_implementation="flash_attention_2" if USE_FA else "sdpa",
129
+ trust_remote_code=True,
130
+ ).to(DEVICE)
131
+
132
+ audio, sr = librosa.load(AUDIO_PATH, sr=16000)
133
+
134
+ # Prepare the user message and generate the prompt
135
+ user_message = {
136
+ "role": "user",
137
+ "content": "<|audio_1|> Transcribe the audio clip into text.",
138
+ }
139
+ prompt = processor.tokenizer.apply_chat_template(
140
+ [user_message], tokenize=False, add_generation_prompt=True
141
+ )
142
+
143
+ # Build the inputs for the model
144
+ inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt")
145
+ inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}
146
+
147
+ # Generate transcription without gradients
148
+ with torch.no_grad():
149
+ generated_ids = model.generate(
150
+ **inputs,
151
+ eos_token_id=processor.tokenizer.eos_token_id,
152
+ max_new_tokens=64,
153
+ do_sample=False,
154
+ )
155
+
156
+ # Decode the generated token IDs into a human-readable transcription
157
+ transcription = processor.decode(
158
+ generated_ids[0, inputs["input_ids"].shape[1] :],
159
+ skip_special_tokens=True,
160
+ clean_up_tokenization_spaces=False,
161
+ )
162
+
163
+ # Print the transcription
164
+ print(transcription)
165
+ ```