Updated README.md
Browse files
README.md
CHANGED
|
@@ -1,3 +1,83 @@
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
license: mit
|
| 3 |
---
|
| 4 |
+
|
| 5 |
+
This model is the ONNX version of [https://huggingface.co/SamLowe/roberta-base-go_emotions](https://huggingface.co/SamLowe/roberta-base-go_emotions).
|
| 6 |
+
|
| 7 |
+
### Full precision ONNX version
|
| 8 |
+
|
| 9 |
+
`onnx/model.onnx` is the full precision ONNX version
|
| 10 |
+
|
| 11 |
+
- that has identical performance to the original transformers model
|
| 12 |
+
- and has the same model size (499MB)
|
| 13 |
+
- is faster than inference than normal Transformers, particularly for smaller batch sizes
|
| 14 |
+
- in my tests about 2x to 3x as fast for a batch size of 1 on a 8 core 11th gen i7 CPU using OnnxRuntime
|
| 15 |
+
|
| 16 |
+
### Quaantized (INT8) ONNX version
|
| 17 |
+
|
| 18 |
+
`onnx/model_quantized.onnx` is the int8 quantized version
|
| 19 |
+
|
| 20 |
+
- that is one quarter the size (125MB) of the full precision model (above)
|
| 21 |
+
- but delivers almost all of the accuracy
|
| 22 |
+
- is faster than inference
|
| 23 |
+
- about 2x as fast for a batch size of 1 on an 8 core 11th gen i7 CPU using ONNXRuntime vs the full precision model above
|
| 24 |
+
- which makes it circa 5x as fast as the full precision normal Transformers model (on the above mentioned CPU, for a batch of 1)
|
| 25 |
+
|
| 26 |
+
### How to use
|
| 27 |
+
|
| 28 |
+
#### Using Optimum Library ONNX Classes
|
| 29 |
+
|
| 30 |
+
To follow.
|
| 31 |
+
|
| 32 |
+
#### Using ONNXRuntime
|
| 33 |
+
|
| 34 |
+
- Tokenization can be done before with the `tokenizers` library,
|
| 35 |
+
- and then the fed into ONNXRuntime as the type of dict it uses,
|
| 36 |
+
- and then simply the postprocessing sigmoid is needed afterward on the model output (which comes as a numpy array) to create the embeddings.
|
| 37 |
+
|
| 38 |
+
```python
|
| 39 |
+
from tokenizers import Tokenizer
|
| 40 |
+
import onnxruntime as ort
|
| 41 |
+
|
| 42 |
+
from os import cpu_count
|
| 43 |
+
import numpy as np # only used for the postprocessing sigmoid
|
| 44 |
+
|
| 45 |
+
sentences = ["hello world"] # for example a batch of 1
|
| 46 |
+
|
| 47 |
+
tokenizer = Tokenizer.from_pretrained("SamLowe/roberta-base-go_emotions")
|
| 48 |
+
|
| 49 |
+
# optional - set pad to only pad to longest in batch, not a fixed length. Without this, the model will run slower, esp for shorter input strings.
|
| 50 |
+
params = {**tokenizer.padding, "length": None}
|
| 51 |
+
tokenizer.enable_padding(**params)
|
| 52 |
+
|
| 53 |
+
tokens_obj = tokenizer.encode_batch(sentences)
|
| 54 |
+
|
| 55 |
+
def load_onnx_model(model_filepath):
|
| 56 |
+
_options = ort.SessionOptions()
|
| 57 |
+
_options.inter_op_num_threads, _options.intra_op_num_threads = cpu_count(), cpu_count()
|
| 58 |
+
_providers = ["CPUExecutionProvider"] # could use ort.get_available_providers()
|
| 59 |
+
return ort.InferenceSession(path_or_bytes=model_filepath, sess_options=_options, providers=_providers)
|
| 60 |
+
|
| 61 |
+
model = load_onnx_model("path_to_model_dot_onnx_or_model_quantized_dot_onnx")
|
| 62 |
+
output_names = [model.get_outputs()[0].name] # E.g. ["logits"]
|
| 63 |
+
|
| 64 |
+
model_input = {
|
| 65 |
+
"input_ids": [t.ids for t in tokens_obj],
|
| 66 |
+
"attention_mask": [t.attention_mask for t in tokens_obj]
|
| 67 |
+
}
|
| 68 |
+
|
| 69 |
+
def sigmoid(_outputs):
|
| 70 |
+
return 1.0 / (1.0 + np.exp(-_outputs))
|
| 71 |
+
|
| 72 |
+
model_output = model.run(
|
| 73 |
+
output_names=output_names,
|
| 74 |
+
input_feed=create_model_input(batch_sentences, model, verbose=False),
|
| 75 |
+
)[0]
|
| 76 |
+
|
| 77 |
+
embeddings = sigmoid(model_output)
|
| 78 |
+
print(embeddings)
|
| 79 |
+
```
|
| 80 |
+
|
| 81 |
+
### Example notebook: showing usage, accuracy & performance
|
| 82 |
+
|
| 83 |
+
Notebook with more details to follow.
|