Update readme
Browse files
README.md
CHANGED
@@ -145,6 +145,8 @@ With Phi-4-multimodal-instruct, a single new open model has been trained across
|
|
145 |
It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
|
146 |
|
147 |
## Model Quality
|
|
|
|
|
148 |
|
149 |
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
|
150 |
|
@@ -262,6 +264,7 @@ BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very
|
|
262 |
|
263 |

|
264 |
|
|
|
265 |
|
266 |
## Usage
|
267 |
|
@@ -474,6 +477,23 @@ print(f'>>> Response\n{response}')
|
|
474 |
|
475 |
More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
|
476 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
477 |
## Training
|
478 |
|
479 |
### Fine-tuning
|
|
|
145 |
It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!
|
146 |
|
147 |
## Model Quality
|
148 |
+
<details>
|
149 |
+
<summary>Click to view details</summary>
|
150 |
|
151 |
To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:
|
152 |
|
|
|
264 |
|
265 |

|
266 |
|
267 |
+
</details>
|
268 |
|
269 |
## Usage
|
270 |
|
|
|
477 |
|
478 |
More inference examples can be found [**here**](https://huggingface.co/microsoft/Phi-4-multimodal-instruct/blob/main/sample_inference_phi4mm.py).
|
479 |
|
480 |
+
### vLLM inference
|
481 |
+
|
482 |
+
User can start a server with this command
|
483 |
+
|
484 |
+
```bash
|
485 |
+
python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<path to speech lora folder> vision=<path to vision lora folder>
|
486 |
+
```
|
487 |
+
|
488 |
+
The speech lora and vision lora folders are within the Phi-4-multimodal-instruct folder downloaded by vLLM, you can also use the following script to find thoses:
|
489 |
+
|
490 |
+
```python
|
491 |
+
from huggingface_hub import snapshot_download
|
492 |
+
model_path = snapshot_download(repo_id="microsoft/Phi-4-multimodal-instruct")
|
493 |
+
speech_lora_path = model_path+"/speech-lora"
|
494 |
+
vision_lora_path = model_path+"/vision-lora"
|
495 |
+
```
|
496 |
+
|
497 |
## Training
|
498 |
|
499 |
### Fine-tuning
|