kylesayrs commited on
Commit
087092d
·
verified ·
1 Parent(s): 03a2c89

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +230 -0
README.md ADDED
@@ -0,0 +1,230 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - w4a16
4
+ - int4
5
+ - vllm
6
+ - audio
7
+ license: apache-2.0
8
+ license_link: https://huggingface.co/datasets/choosealicense/licenses/blob/main/markdown/apache-2.0.md
9
+ language:
10
+ - en
11
+ base_model: openai/whisper-large-v3
12
+ library_name: transformers
13
+ ---
14
+
15
+ # whisper-large-v3-quantized.w4a16
16
+
17
+ ## Model Overview
18
+ - **Model Architecture:** whisper-large-v3
19
+ - **Input:** Audio-Text
20
+ - **Output:** Text
21
+ - **Model Optimizations:**
22
+ - **Weight quantization:** INT4
23
+ - **Activation quantization:** FP16
24
+ - **Release Date:** 1/31/2025
25
+ - **Version:** 1.0
26
+ - **Model Developers:** Neural Magic
27
+
28
+ Quantized version of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3).
29
+
30
+ ### Model Optimizations
31
+
32
+ This model was obtained by quantizing the weights of [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) to INT4 data type, ready for inference with vLLM >= 0.5.2.
33
+
34
+ ## Deployment
35
+
36
+ ### Use with vLLM
37
+
38
+ This model can be deployed efficiently using the [vLLM](https://docs.vllm.ai/en/latest/) backend, as shown in the example below.
39
+
40
+ ```python
41
+ from vllm.assets.audio import AudioAsset
42
+ from vllm import LLM, SamplingParams
43
+
44
+ # prepare model
45
+ llm = LLM(
46
+ model="neuralmagic/whisper-large-v3.w4a16",
47
+ max_model_len=448,
48
+ max_num_seqs=400,
49
+ limit_mm_per_prompt={"audio": 1},
50
+ )
51
+
52
+ # prepare inputs
53
+ inputs = { # Test explicit encoder/decoder prompt
54
+ "encoder_prompt": {
55
+ "prompt": "",
56
+ "multi_modal_data": {
57
+ "audio": AudioAsset("winning_call").audio_and_sample_rate,
58
+ },
59
+ },
60
+ "decoder_prompt": "<|startoftranscript|>",
61
+ }
62
+
63
+ # generate response
64
+ print("========== SAMPLE GENERATION ==============")
65
+ outputs = llm.generate(inputs, SamplingParams(temperature=0.0, max_tokens=64))
66
+ print(f"PROMPT : {outputs[0].prompt}")
67
+ print(f"RESPONSE: {outputs[0].outputs[0].text}")
68
+ print("==========================================")
69
+ ```
70
+
71
+ vLLM also supports OpenAI-compatible serving. See the [documentation](https://docs.vllm.ai/en/latest/) for more details.
72
+
73
+ ## Creation
74
+
75
+ This model was created with [llm-compressor](https://github.com/vllm-project/llm-compressor) by running the code snippet below as part a multimodal announcement blog.
76
+
77
+ ```python
78
+ import torch
79
+ from datasets import load_dataset
80
+ from transformers import WhisperProcessor
81
+
82
+ from llmcompressor.modifiers.quantization import GPTQModifier
83
+ from llmcompressor.transformers import oneshot
84
+ from llmcompressor.transformers.tracing import TraceableWhisperForConditionalGeneration
85
+
86
+ # Select model and load it.
87
+ MODEL_ID = "openai/whisper-large-v3"
88
+
89
+ model = TraceableWhisperForConditionalGeneration.from_pretrained(
90
+ MODEL_ID,
91
+ device_map="auto",
92
+ torch_dtype="auto",
93
+ )
94
+ model.config.forced_decoder_ids = None
95
+ processor = WhisperProcessor.from_pretrained(MODEL_ID)
96
+
97
+ # Configure processor the dataset task.
98
+ processor.tokenizer.set_prefix_tokens(language="en", task="transcribe")
99
+
100
+ # Select calibration dataset.
101
+ DATASET_ID = "MLCommons/peoples_speech"
102
+ DATASET_SUBSET = "test"
103
+ DATASET_SPLIT = "test"
104
+
105
+ # Select number of samples. 512 samples is a good place to start.
106
+ # Increasing the number of samples can improve accuracy.
107
+ NUM_CALIBRATION_SAMPLES = 512
108
+ MAX_SEQUENCE_LENGTH = 2048
109
+
110
+ # Load dataset and preprocess.
111
+ ds = load_dataset(
112
+ DATASET_ID,
113
+ DATASET_SUBSET,
114
+ split=f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]",
115
+ trust_remote_code=True,
116
+ )
117
+
118
+
119
+ def preprocess(example):
120
+ return {
121
+ "array": example["audio"]["array"],
122
+ "sampling_rate": example["audio"]["sampling_rate"],
123
+ "text": " " + example["text"].capitalize(),
124
+ }
125
+
126
+
127
+ ds = ds.map(preprocess, remove_columns=ds.column_names)
128
+
129
+
130
+ # Process inputs.
131
+ def process(sample):
132
+ inputs = processor(
133
+ audio=sample["array"],
134
+ sampling_rate=sample["sampling_rate"],
135
+ text=sample["text"],
136
+ add_special_tokens=True,
137
+ return_tensors="pt",
138
+ )
139
+
140
+ inputs["input_features"] = inputs["input_features"].to(dtype=model.dtype)
141
+ inputs["decoder_input_ids"] = inputs["labels"]
142
+ del inputs["labels"]
143
+
144
+ return inputs
145
+
146
+
147
+ ds = ds.map(process, remove_columns=ds.column_names)
148
+
149
+
150
+ # Define a oneshot data collator for multimodal inputs.
151
+ def data_collator(batch):
152
+ assert len(batch) == 1
153
+ return {key: torch.tensor(value) for key, value in batch[0].items()}
154
+
155
+
156
+ # Recipe
157
+ recipe = GPTQModifier(targets="Linear", scheme="W4A16", ignore=["lm_head"])
158
+
159
+ # Apply algorithms.
160
+ oneshot(
161
+ model=model,
162
+ dataset=ds,
163
+ recipe=recipe,
164
+ max_seq_length=MAX_SEQUENCE_LENGTH,
165
+ num_calibration_samples=NUM_CALIBRATION_SAMPLES,
166
+ data_collator=data_collator,
167
+ )
168
+
169
+ # Confirm generations of the quantized model look sane.
170
+ print("\n\n")
171
+ print("========== SAMPLE GENERATION ==============")
172
+ sample_features = next(iter(ds))["input_features"]
173
+ sample_decoder_ids = [processor.tokenizer.prefix_tokens]
174
+ sample_input = {
175
+ "input_features": torch.tensor(sample_features).to(model.device),
176
+ "decoder_input_ids": torch.tensor(sample_decoder_ids).to(model.device),
177
+ }
178
+
179
+ output = model.generate(**sample_input, language="en")
180
+ print(processor.batch_decode(output, skip_special_tokens=True))
181
+ print("==========================================\n\n")
182
+ # that's where you have a lot of windows in the south no actually that's passive solar
183
+ # and passive solar is something that was developed and designed in the 1960s and 70s
184
+ # and it was a great thing for what it was at the time but it's not a passive house
185
+
186
+ # Save to disk compressed.
187
+ SAVE_DIR = MODEL_ID.split("/")[1] + "-W4A16-G128"
188
+ model.save_pretrained(SAVE_DIR, save_compressed=True)
189
+ processor.save_pretrained(SAVE_DIR)
190
+ ```
191
+
192
+
193
+ ## Evaluation
194
+ Base Model
195
+ ```
196
+ Total Test Time: 94.4606 seconds
197
+ Total Requests: 511
198
+ Successful Requests: 511
199
+ Average Latency: 53.3529 seconds
200
+ Median Latency: 52.7258 seconds
201
+ 95th Percentile Latency: 86.5851 seconds
202
+ Estimated req_Throughput: 5.41 requests/s
203
+ Estimated Throughput: 100.79 tok/s
204
+ WER: 12.660815197787665
205
+ ```
206
+
207
+ W4A16
208
+ ```
209
+ Total Test Time: 106.2064 seconds
210
+ Total Requests: 511
211
+ Successful Requests: 511
212
+ Average Latency: 59.7467 seconds
213
+ Median Latency: 58.3930 seconds
214
+ 95th Percentile Latency: 97.4831 seconds
215
+ Estimated req_Throughput: 4.81 requests/s
216
+ Estimated Throughput: 89.35 tok/s
217
+ WER: 12.949380786341228
218
+ ```
219
+
220
+ ### BibTeX entry and citation info
221
+ ```bibtex
222
+ @misc{radford2022whisper,
223
+ doi = {10.48550/ARXIV.2212.04356},
224
+ url = {https://arxiv.org/abs/2212.04356},
225
+ author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
226
+ title = {Robust Speech Recognition via Large-Scale Weak Supervision},
227
+ publisher = {arXiv},
228
+ year = {2022},
229
+ copyright = {arXiv.org perpetual, non-exclusive license}
230
+ }