SamLowe
/

roberta-base-go_emotions-onnx

Text Classification

multi-class-classification

multi-label-classification

Model card Files Files and versions

SamLowe commited on Sep 28, 2023

Commit

3350320

·

1 Parent(s): f875c5b

Updated README.md

Files changed (1) hide show

README.md +80 -0

README.md CHANGED Viewed

@@ -1,3 +1,83 @@
 ---
 license: mit
 ---

 ---
 license: mit
 ---
+This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).
+### Full precision ONNX version
+`onnx/model.onnx` is the full precision ONNX version
+- that has identical performance to the original transformers model
+- and has the same model size (499MB)
+- is faster than inference than normal Transformers, particularly for smaller batch sizes
+  - in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using OnnxRuntime
+### Quaantized (INT8) ONNX version
+`onnx/model_quantized.onnx` is the int8 quantized version
+- that is one quarter the size (125MB) of the full precision model (above)
+- but delivers almost all of the accuracy
+- is faster than inference
+  - about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
+  - which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)
+### How to use
+#### Using Optimum Library ONNX Classes
+To follow.
+#### Using ONNXRuntime
+- Tokenization can be done before with the `tokenizers` library,
+- and then the fed into ONNXRuntime as the type of dict it uses,
+- and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.
+```python
+from tokenizers import Tokenizer
+import onnxruntime as ort
+from os import cpu_count
+import numpy as np  # only used for the postprocessing sigmoid
+sentences = ["hello world"]  # for example a batch of 1
+tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
+# optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings.
+params = {**tokenizer.padding, "length": None}
+tokenizer.enable_padding(**params)
+tokens_obj = tokenizer.encode_batch(sentences)
+def load_onnx_model(model_filepath):
+    _options = ort.SessionOptions()
+    _options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
+    _providers = ["CPUExecutionProvider"]  # could use ort.get_available_providers()
+    return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)
+model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
+output_names = [model.get_outputs()[0].name]  # E.g. ["logits"]
+model_input = {
+  "input_ids": [t.ids for t in tokens_obj],
+  "attention_mask": [t.attention_mask for t in tokens_obj]
+}
+def sigmoid(_outputs):
+  return 1.0 / (1.0 + np.exp(-_outputs))
+model_output = model.run(
+  output_names=output_names,
+  input_feed=create_model_input(batch_sentences, model, verbose=False),
+)[0]
+embeddings = sigmoid(model_output)
+print(embeddings)
+```
+### Example notebook: showing usage, accuracy & performance
+Notebook with more details to follow.