Update README.md

61ee751 verified 4 months ago

5.18 kB

	---
	library_name: transformers
	language:
	- zh
	license: mit
	base_model: microsoft/Phi-4-multimodal-instruct
	tags:
	- automatic-speech-recognition
	- audio
	- speech
	- generated_from_trainer
	datasets:
	- JacobLinCool/common_voice_19_0_zh-TW
	metrics:
	- wer
	- cer
	model-index:
	- name: Phi-4-multimodal-instruct-commonvoice-zh-tw
	results:
	- task:
	type: automatic-speech-recognition
	name: Automatic Speech Recognition
	dataset:
	name: JacobLinCool/common_voice_19_0_zh-TW
	type: JacobLinCool/common_voice_19_0_zh-TW
	metrics:
	- type: wer
	value: 31.18
	name: Wer
	- type: cer
	value: 6.67
	name: Cer
	---

	# Phi-4-multimodal-instruct-commonvoice-zh-tw

	This model is a fine-tuned version of [microsoft/Phi-4-multimodal-instruct](https://huggingface.co/microsoft/Phi-4-multimodal-instruct) on the [Common Voice 19.0 Taiwanese Mandarin dataset](https://huggingface.co/datasets/JacobLinCool/common_voice_19_0_zh-TW).

	- WER: 31.18%
	- CER: 6.67%

	## Model description

	Phi-4-multimodal-instruct-commonvoice-zh-tw is a multimodal language model fine-tuned for Automated Speech Recognition (ASR) of Taiwanese Mandarin (zh-TW). The base model is Microsoft's Phi-4-multimodal-instruct, which was further trained on speech transcription tasks.

	The model accepts audio input and produces Traditional Chinese text transcriptions. It has been specifically optimized to recognize Taiwanese Mandarin speech patterns and vocabulary.

	## Intended uses & limitations

	This model is intended for:
	- Transcribing spoken Taiwanese Mandarin to text
	- Automated subtitling/captioning for zh-TW content
	- Speech-to-text applications requiring Taiwanese Mandarin support

	Limitations:
	- Performance may vary with background noise, speaking speed, or accents
	- The model performs best with clear audio input
	- Specialized terminology or domain-specific vocabulary may have lower accuracy

	## Training and evaluation data

	The model was fine-tuned on Common Voice 19.0 Taiwanese Mandarin dataset. Common Voice is a crowdsourced speech dataset containing contributions from volunteers who record themselves reading sentences in various languages.

	The evaluation was performed on the test split of the same dataset, consisting of 5,013 samples.

	## Training procedure

	The model was trained using LoRA adapters focused on the speech recognition components of the base model, allowing for efficient fine-tuning while preserving the general capabilities of the underlying Phi-4 model.

	### Prompt format

	This model follows the prompt template from the original paper. For speech recognition tasks, the audio input is provided inline with a simple instruction:

	```
	<\|user\|>
	<\|audio_1\|> Transcribe the audio clip into text.
	<\|assistant\|>
	[Transcription output in Traditional Chinese]
	<\|end\|>
	```

	### Training hyperparameters

	The following hyperparameters were used during training:
	- learning_rate: 4e-05
	- train_batch_size: 4
	- eval_batch_size: 8
	- seed: 42
	- gradient_accumulation_steps: 32
	- total_train_batch_size: 128
	- optimizer: Use OptimizerNames.ADAMW_TORCH with betas=(0.9,0.95) and epsilon=1e-07 and optimizer_args=No additional optimizer arguments
	- lr_scheduler_type: linear
	- lr_scheduler_warmup_steps: 50
	- num_epochs: 2

	### Training results

	The model achieved the following performance metrics on the test set:
	- Word Error Rate (WER): 31.18%
	- Character Error Rate (CER): 6.67%
	- Number of evaluation samples: 5,013

	### Framework versions

	- Transformers 4.49.0
	- Pytorch 2.4.1+cu124
	- Datasets 3.3.2
	- Tokenizers 0.21.1

	## How to use

	```python
	import torch
	from transformers import AutoProcessor, AutoModelForCausalLM
	import librosa

	AUDIO_PATH = "test.wav"

	MODEL = "JacobLinCool/Phi-4-multimodal-instruct-commonvoice-zh-tw"
	DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
	USE_FA = True

	processor = AutoProcessor.from_pretrained(MODEL, trust_remote_code=True)

	model = AutoModelForCausalLM.from_pretrained(
	MODEL,
	torch_dtype=torch.bfloat16 if USE_FA else torch.float32,
	_attn_implementation="flash_attention_2" if USE_FA else "sdpa",
	trust_remote_code=True,
	).to(DEVICE)

	audio, sr = librosa.load(AUDIO_PATH, sr=16000)

	# Prepare the user message and generate the prompt
	user_message = {
	"role": "user",
	"content": "<\|audio_1\|> Transcribe the audio clip into text.",
	}
	prompt = processor.tokenizer.apply_chat_template(
	[user_message], tokenize=False, add_generation_prompt=True
	)

	# Build the inputs for the model
	inputs = processor(text=prompt, audios=[(audio, sr)], return_tensors="pt")
	inputs = {k: v.to(model.device) if hasattr(v, "to") else v for k, v in inputs.items()}

	# Generate transcription without gradients
	with torch.no_grad():
	generated_ids = model.generate(
	**inputs,
	eos_token_id=processor.tokenizer.eos_token_id,
	max_new_tokens=64,
	do_sample=False,
	)

	# Decode the generated token IDs into a human-readable transcription
	transcription = processor.decode(
	generated_ids[0, inputs["input_ids"].shape[1] :],
	skip_special_tokens=True,
	clean_up_tokenization_spaces=False,
	)

	# Print the transcription
	print(transcription)
	```