The intent of this repo is to compare the performance delta between dense quantized MPT-7B and 70% sparse-quantized MPT-7B on OpenVINO. Quantization here is 8-bit on both weight and activation. Benchmark metric is decoding (next token) latency with context length 512.

Target HW: Intel 4th gen Xeon (Sapphire Rapids)

SW

git clone https://huggingface.co/vuiseng9/ov-mpt-7b-gsm8k-sparse70
pip install openvino==2024.2.0

Benchmarking with OpenVINO

  1. ./benchmarkapp_w8a8.bash
  2. ./benchmarkapp_w8a8_sparse70.bash

Note: do remove the numactl if your node does not support it.

Implementation of Sparse Weight Decompression in OpenVINO

  • This is the first commit of Sparse Weight Decompression on OpenVINO’s fork of oneDNN. https://github.com/openvinotoolkit/oneDNN/pull/158/files

  • you can browse this via the left pane.

  • initialization: src/cpu/reorder/simple_sparse_reorder.hpp (line 113)

  • decompression: src/cpu/x64/jit_brgemm_decompress_kernel.cpp (line 41)

  • If you'd like to build OpenVINO runtime from source for debug, see wiki page. Benchmark_app is compiled as well.

Related materials:

OpenVINO blog on Sparse-Quantized BERT (corresponding notebook)

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support text-generation models for OpenVINO library.