nm-research commited on
Commit
aa0dfad
·
verified ·
1 Parent(s): f409e55

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +211 -0
README.md ADDED
@@ -0,0 +1,211 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - FP8
4
+ - vllm
5
+ - audio
6
+ license: apache-2.0
7
+ license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
8
+ language:
9
+ - en
10
+ base_model: openai/whisper-tiny
11
+ library_name: transformers
12
+ ---
13
+
14
+ # whisper-tiny-FP8-Dynamic
15
+
16
+ ## Model Overview
17
+ - **Model Architecture:** whisper-tiny
18
+ - **Input:** Audio-Text
19
+ - **Output:** Text
20
+ - **Model Optimizations:**
21
+ - **Weight quantization:** FP8
22
+ - **Activation quantization:** FP8
23
+ - **Release Date:** 04/16/2025
24
+ - **Version:** 1.0
25
+ - **Model Developers:** Neural Magic
26
+
27
+ Quantized version of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny).
28
+
29
+ ### Model Optimizations
30
+
31
+ This model was obtained by quantizing the weights of [openai/whisper-tiny](https://huggingface.co/openai/whisper-tiny) to FP8 data type, ready for inference with vLLM >= 0.5.2.
32
+
33
+ ## Deployment
34
+
35
+ ### Use with vLLM
36
+
37
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
38
+
39
+ ```python
40
+ from vllm.assets.audio import AudioAsset
41
+ from vllm import LLM, SamplingParams
42
+
43
+ # prepare model
44
+ llm = LLM(
45
+ model="neuralmagic/whisper-tiny-FP8-Dynamic",
46
+ max_model_len=448,
47
+ max_num_seqs=400,
48
+ limit_mm_per_prompt={"audio": 1},
49
+ )
50
+
51
+ # prepare inputs
52
+ inputs = { # Test explicit encoder/decoder prompt
53
+ "encoder_prompt": {
54
+ "prompt": "",
55
+ "multi_modal_data": {
56
+ "audio": AudioAsset("winning_call").audio_and_sample_rate,
57
+ },
58
+ },
59
+ "decoder_prompt": "<|startoftranscript|>",
60
+ }
61
+
62
+ # generate response
63
+ print("========== SAMPLE GENERATION ==============")
64
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
65
+ print(f"PROMPT : {outputs[0].prompt}")
66
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
67
+ print("==========================================")
68
+ ```
69
+
70
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
71
+
72
+ ## Creation
73
+
74
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below.
75
+
76
+ <details>
77
+ <summary>Model Creation Code</summary>
78
+
79
+ ```bash
80
+ python quantize.py \
81
+ --model_path openai/whisper-tiny \
82
+ --quant_path output_dir/whisper-tiny-FP8-Dynamic
83
+ ```
84
+
85
+
86
+ ```python
87
+ import argparse
88
+ import torch
89
+ import os
90
+ from datasets import load_dataset
91
+ from transformers import WhisperProcessor
92
+ from llmcompressor import oneshot
93
+ from llmcompressor.modifiers.quantization import QuantizationModifier
94
+ from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
95
+ from compressed_tensors.quantization import QuantizationType
96
+
97
+ # --- Args ---
98
+ parser = argparse.ArgumentParser()
99
+ parser.add_argument('--model_path', type=str, required=True)
100
+ parser.add_argument('--quant_path', type=str, required=True)
101
+ parser.add_argument('--observer', type=str, default="minmax")
102
+ args = parser.parse_args()
103
+
104
+ # --- Load Model ---
105
+ model = TraceableWhisperForConditionalGeneration.from_pretrained(
106
+ args.model_path,
107
+ device_map="auto",
108
+ torch_dtype="auto",
109
+ )
110
+ model.config.forced_decoder_ids = None
111
+ processor = WhisperProcessor.from_pretrained(args.model_path)
112
+
113
+ # --- Recipe (FP8 Dynamic) ---
114
+ recipe = [
115
+ QuantizationModifier(
116
+ targets="Linear",
117
+ scheme="FP8_DYNAMIC",
118
+ sequential_targets=["WhisperEncoderLayer", "WhisperDecoderLayer"],
119
+ ignore=["re:.*lm_head"],
120
+ )
121
+ ]
122
+
123
+ # --- Run oneshot ---
124
+ oneshot(
125
+ model=model,
126
+ recipe=recipe,
127
+ trust_remote_code_model=True,
128
+ )
129
+
130
+ # --- Save ---
131
+ os.makedirs(args.quant_path, exist_ok=True)
132
+ model.save_pretrained(args.quant_path, save_compressed=True)
133
+ processor.save_pretrained(args.quant_path)
134
+
135
+
136
+ ```
137
+ </details>
138
+
139
+ ## Evaluation
140
+
141
+ The model was evaluated on [LibriSpeech](https://huggingface.co/datasets/lmms-lab/librispeech) and [Fleurs](https://huggingface.co/datasets/lmms-lab/fleurs) datasets using [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), via the following commands:
142
+
143
+ <details>
144
+ <summary>Evaluation Commands</summary>
145
+
146
+ Librispeech:
147
+ ```
148
+ lmms-eval \
149
+ --model=whisper_vllm \
150
+ --model_args="pretrained=neuralmagic-ent/whisper-tiny-FP8-Dynamic" \
151
+ --batch_size 64 \
152
+ --output_path <output_file_path> \
153
+ --tasks librispeech
154
+ ```
155
+
156
+ Fleurs:
157
+ ```
158
+ lmms-eval \
159
+ --model=whisper_vllm \
160
+ --model_args="pretrained=neuralmagic-ent/whisper-tiny-FP8-Dynamic" \
161
+ --batch_size 64 \
162
+ --output_path <output_file_path> \
163
+ --tasks fleurs
164
+ ```
165
+ </details>
166
+
167
+ <table>
168
+ <thead>
169
+ <tr>
170
+ <th>Benchmark</th>
171
+ <th>Split</th>
172
+ <th>BF16</th>
173
+ <th>w8a8</th>
174
+ <th>Recovery (%)</th>
175
+ </tr>
176
+ </thead>
177
+ <tbody>
178
+ <tr>
179
+ <td rowspan="2"><b>LibriSpeech (WER)</b></td>
180
+ <td>test-clean</td>
181
+ <td>7.6602</td>
182
+ <td>7.8941</td>
183
+ <td>96.53%</td>
184
+ </tr>
185
+ <tr>
186
+ <td>test-other</td>
187
+ <td>17.1041</td>
188
+ <td>17.1325</td>
189
+ <td>98.74%</td>
190
+ </tr>
191
+ <tr>
192
+ <td rowspan="3"><b>Fleurs (X→en, BLEU)</b></td>
193
+ <td>cmn_hans_cn</td>
194
+ <td></td>
195
+ <td></td>
196
+ <td></td>
197
+ </tr>
198
+ <tr>
199
+ <td>en</td>
200
+ <td></td>
201
+ <td></td>
202
+ <td></td>
203
+ </tr>
204
+ <tr>
205
+ <td>yue_hant_hk</td>
206
+ <td></td>
207
+ <td></td>
208
+ <td></td>
209
+ </tr>
210
+ </tbody>
211
+ </table>