model

Browse files

Files changed (13) hide show

README.md +179 -3
config.json +25 -0
conversion_config.json +11 -0
convert.py +51 -0
onnx/model.onnx +3 -0
onnx/model_int8.onnx +3 -0
onnx/model_uint8.onnx +3 -0
special_tokens_map.json +37 -0
test_local.py +49 -0
test_teradata.py +106 -0
tokenizer.json +0 -0
tokenizer_config.json +59 -0
vocab.txt +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,179 @@
----
-license: apache-2.0
----

+---
+language:
+- multilingual
+- ar
+- zh
+- nl
+- en
+- fr
+- de
+- it
+- ko
+- pl
+- pt
+- ru
+- es
+- tr
+library_name: sentence-transformers
+license: apache-2.0
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- feature-extraction
+- sentence-similarity
+- onnx
+- teradata
+---
+# A Teradata Vantage compatible Embeddings Model
+# sentence-transformers/distiluse-base-multilingual-cased-v1
+## Overview of this Model
+An Embedding Model which maps text (sentence/ paragraphs) into a vector.  The [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) model well known for its effectiveness in capturing semantic meanings in text data. It's a state-of-the-art model trained on a large corpus, capable of generating high-quality text embeddings.
+- 134.73M params (Sizes in ONNX format - "fp32": 515.61MB, "int8": 129.29MB, "uint8": 129.29MB)
+- 512 maximum input tokens
+- 512 dimensions of output vector
+- Licence: apache-2.0. The released models can be used for commercial purposes free of charge.
+- Reference to Original Model: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1
+## Quickstart: Deploying this Model in Teradata Vantage
+We have pre-converted the model into the ONNX format compatible with BYOM 6.0, eliminating the need for manual conversion.
+**Note:** Ensure you have access to a Teradata Database with BYOM 6.0 installed.
+To get started, clone the pre-converted model directly from the Teradata HuggingFace repository.
+```python
+import teradataml as tdml
+import getpass
+from huggingface_hub import hf_hub_download
+model_name = "distiluse-base-multilingual-cased-v1"
+number_dimensions_output = 512
+model_file_name = "model.onnx"
+# Step 1: Download Model from Teradata HuggingFace Page
+hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
+hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")
+# Step 2: Create Connection to Vantage
+tdml.create_context(host = input('enter your hostname'),
+                    username=input('enter your username'),
+                    password = getpass.getpass("enter your password"))
+# Step 3: Load Models into Vantage
+# a) Embedding model
+tdml.save_byom(model_id = model_name, # must be unique in the models table
+               model_file = model_file_name,
+               table_name = 'embeddings_models' )
+# b) Tokenizer
+tdml.save_byom(model_id = model_name, # must be unique in the models table
+              model_file = 'tokenizer.json',
+              table_name = 'embeddings_tokenizers')
+# Step 4: Test ONNXEmbeddings Function
+# Note that ONNXEmbeddings expects the 'payload' column to be 'txt'.
+# If it has got a different name, just rename it in a subquery/CTE.
+input_table = "emails.emails"
+embeddings_query = f"""
+SELECT
+        *
+from mldb.ONNXEmbeddings(
+        on {input_table} as InputTable
+        on (select * from embeddings_models where model_id = '{model_name}') as ModelTable DIMENSION
+        on (select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}') as TokenizerTable DIMENSION
+        using
+            Accumulate('id', 'txt')
+            ModelOutputTensor('sentence_embedding')
+            EnableMemoryCheck('false')
+            OutputFormat('FLOAT32({number_dimensions_output})')
+            OverwriteCachedModel('true')
+    ) a
+"""
+DF_embeddings = tdml.DataFrame.from_query(embeddings_query)
+DF_embeddings
+```
+## What Can I Do with the Embeddings?
+Teradata Vantage includes pre-built in-database functions to process embeddings further. Explore the following examples:
+- **Semantic Clustering with TD_KMeans:** [Semantic Clustering Python Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/Semantic_Clustering_Python.ipynb)
+- **Semantic Distance with TD_VectorDistance:** [Semantic Similarity Python Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/Semantic_Similarity_Python.ipynb)
+- **RAG-Based Application with TD_VectorDistance:** [RAG and Bedrock Query PDF Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/RAG_and_Bedrock_QueryPDF.ipynb)
+## Deep Dive into Model Conversion to ONNX
+**The steps below outline how we converted the open-source Hugging Face model into an ONNX file compatible with the in-database ONNXEmbeddings function.**
+You do not need to perform these steps—they are provided solely for documentation and transparency. However, they may be helpful if you wish to convert another model to the required format.
+### Part 1. Importing and Converting Model using optimum
+We start by importing the pre-trained [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) model from Hugging Face.
+To enhance performance and ensure compatibility with various execution environments, we'll use the [Optimum](https://github.com/huggingface/optimum) utility to convert the model into the ONNX (Open Neural Network Exchange) format.
+After conversion to ONNX, we are fixing the opset in the ONNX file for compatibility with ONNX runtime used in Teradata Vantage
+We are generating ONNX files for multiple different precisions: fp32, int8, uint8
+You can find the detailed conversion steps in the file [convert.py](./convert.py)
+### Part 2. Running the model in Python with onnxruntime & compare results
+Once the fixes are applied, we proceed to test the correctness of the ONNX model by calculating cosine similarity between two texts using native SentenceTransformers and ONNX runtime, comparing the results.
+If the results are identical, it confirms that the ONNX model gives the same result as the native models, validating its correctness and suitability for further use in the database.
+```python
+import onnxruntime as rt
+from sentence_transformers.util import cos_sim
+from sentence_transformers import SentenceTransformer
+import transformers
+sentences_1 = 'How is the weather today?'
+sentences_2 = 'What is the current weather like today?'
+# Calculate ONNX result
+tokenizer = transformers.AutoTokenizer.from_pretrained("sentence-transformers/distiluse-base-multilingual-cased-v1")
+predef_sess = rt.InferenceSession("onnx/model.onnx")
+enc1 = tokenizer(sentences_1)
+embeddings_1_onnx = predef_sess.run(None,     {"input_ids": [enc1.input_ids],
+     "attention_mask": [enc1.attention_mask]})
+enc2 = tokenizer(sentences_2)
+embeddings_2_onnx = predef_sess.run(None,     {"input_ids": [enc2.input_ids],
+     "attention_mask": [enc2.attention_mask]})
+# Calculate embeddings with SentenceTransformer
+model = SentenceTransformer(model_id, trust_remote_code=True)
+embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True, trust_remote_code=True)
+embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True, trust_remote_code=True)
+# Compare results
+print("Cosine similiarity for embeddings calculated with ONNX:" + str(cos_sim(embeddings_1_onnx[1][0], embeddings_2_onnx[1][0])))
+print("Cosine similiarity for embeddings calculated with SentenceTransformer:" + str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))
+```
+You can find the detailed ONNX vs. SentenceTransformer result comparison steps in the file [test_local.py](./test_local.py)

config.json ADDED Viewed

	@@ -0,0 +1,25 @@

+{
+  "_attn_implementation_autoset": true,
+  "_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1",
+  "activation": "gelu",
+  "architectures": [
+    "DistilBertModel"
+  ],
+  "attention_dropout": 0.1,
+  "dim": 768,
+  "dropout": 0.1,
+  "export_model_type": "transformer",
+  "hidden_dim": 3072,
+  "initializer_range": 0.02,
+  "max_position_embeddings": 512,
+  "model_type": "distilbert",
+  "n_heads": 12,
+  "n_layers": 6,
+  "pad_token_id": 0,
+  "qa_dropout": 0.1,
+  "seq_classif_dropout": 0.2,
+  "sinusoidal_pos_embds": false,
+  "tie_weights_": true,
+  "transformers_version": "4.47.1",
+  "vocab_size": 119547
+}

conversion_config.json ADDED Viewed

	@@ -0,0 +1,11 @@

+{
+    "model_id": "sentence-transformers/distiluse-base-multilingual-cased-v1",
+    "number_of_generated_embeddings": 512,
+    "precision_to_filename_map": {
+        "fp32": "onnx/model.onnx",
+        "int8": "onnx/model_int8.onnx",
+        "uint8": "onnx/model_uint8.onnx"
+    },
+    "opset": 16,
+    "IR": 8
+}

convert.py ADDED Viewed

	@@ -0,0 +1,51 @@

+import os
+import json
+import shutil
+from optimum.exporters.onnx import main_export
+import onnx
+from onnxconverter_common import float16
+import onnxruntime as rt
+from onnxruntime.tools.onnx_model_utils import *
+from onnxruntime.quantization import quantize_dynamic, QuantType
+with open('conversion_config.json') as json_file:
+    conversion_config = json.load(json_file)
+    model_id = conversion_config["model_id"]
+    number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
+    precision_to_filename_map = conversion_config["precision_to_filename_map"]
+    opset = conversion_config["opset"]
+    IR = conversion_config["IR"]
+    op = onnx.OperatorSetIdProto()
+    op.version = opset
+    if not os.path.exists("onnx"):
+        os.makedirs("onnx")
+    print("Exporting the main model version")
+    main_export(model_name_or_path=model_id, output="./",  opset=opset, trust_remote_code=True, task="feature-extraction", dtype="fp32")
+    if "fp32" in precision_to_filename_map:
+        print("Exporting the fp32 onnx file...")
+        shutil.copyfile('model.onnx', precision_to_filename_map["fp32"])
+        print("Done\n\n")
+    if "int8" in precision_to_filename_map:
+        print("Quantizing fp32 model to int8...")
+        quantize_dynamic("model.onnx",  precision_to_filename_map["int8"], weight_type=QuantType.QInt8)
+        print("Done\n\n")
+    if "uint8" in precision_to_filename_map:
+        print("Quantizing fp32 model to uint8...")
+        quantize_dynamic("model.onnx", precision_to_filename_map["uint8"], weight_type=QuantType.QUInt8)
+        print("Done\n\n")
+    os.remove("model.onnx")

onnx/model.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:613abac622b2e5b220e11f54dd41bf6bdff499e19b16e17fcd94291e02c3bdaa
+size 540655997

onnx/model_int8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:560dc998fbbcfe94bfb9173bc8e57cb8bb2daf1cf40c2b21ae86b6694bfcfeb7
+size 135566221

onnx/model_uint8.onnx ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:996a7bdda423a2f2466a11e7dec368f961a130e4481f2d43fecf158c27ffe065
+size 135566238

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

test_local.py ADDED Viewed

	@@ -0,0 +1,49 @@

+import onnxruntime as rt
+from sentence_transformers.util import cos_sim
+from sentence_transformers import SentenceTransformer
+import transformers
+import gc
+import json
+with open('conversion_config.json') as json_file:
+    conversion_config = json.load(json_file)
+    model_id = conversion_config["model_id"]
+    number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
+    precision_to_filename_map = conversion_config["precision_to_filename_map"]
+    sentences_1 = 'How is the weather today?'
+    sentences_2 = 'What is the current weather like today?'
+    print(f"Testing on cosine similiarity between sentences: \n'{sentences_1}'\n'{sentences_2}'\n\n\n")
+    tokenizer = transformers.AutoTokenizer.from_pretrained("./")
+    enc1 = tokenizer(sentences_1)
+    enc2 = tokenizer(sentences_2)
+    for precision, file_name in precision_to_filename_map.items():
+        onnx_session = rt.InferenceSession(file_name)
+        embeddings_1_onnx = onnx_session.run(None,     {"input_ids": [enc1.input_ids],
+             "attention_mask": [enc1.attention_mask]})[1][0]
+        embeddings_2_onnx = onnx_session.run(None,     {"input_ids": [enc2.input_ids],
+             "attention_mask": [enc2.attention_mask]})[1][0]
+        del onnx_session
+        gc.collect()
+        print(f'Cosine similiarity for ONNX model with precision "{precision}" is {str(cos_sim(embeddings_1_onnx, embeddings_2_onnx))}')
+    model = SentenceTransformer(model_id, trust_remote_code=True)
+    embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True, trust_remote_code=True)
+    embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True, trust_remote_code=True)
+    print('Cosine similiarity for original sentence transformer model is '+str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))

test_teradata.py ADDED Viewed

	@@ -0,0 +1,106 @@

+import sys
+import teradataml as tdml
+from tabulate import tabulate
+import json
+with open('conversion_config.json') as json_file:
+    conversion_config = json.load(json_file)
+    model_id = conversion_config["model_id"]
+    number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
+    precision_to_filename_map = conversion_config["precision_to_filename_map"]
+    host = sys.argv[1]
+    username = sys.argv[2]
+    password = sys.argv[3]
+    print("Setting up connection to teradata...")
+    tdml.create_context(host = host, username = username, password = password)
+    print("Done\n\n")
+    print("Deploying tokenizer...")
+    try:
+        tdml.db_drop_table('tokenizer_table')
+    except:
+        print("Can't drop tokenizers table - it's not existing")
+    tdml.save_byom('tokenizer',
+                  'tokenizer.json',
+                  'tokenizer_table')
+    print("Done\n\n")
+    print("Testing models...")
+    try:
+        tdml.db_drop_table('model_table')
+    except:
+        print("Can't drop models table - it's not existing")
+    for precision, file_name in precision_to_filename_map.items():
+        print(f"Deploying {precision} model...")
+        tdml.save_byom(precision,
+                      file_name,
+                      'model_table')
+        print(f"Model {precision} is deployed\n")
+        print(f"Calculating embeddings with {precision} model...")
+        try:
+            tdml.db_drop_table('emails_embeddings_store')
+        except:
+            print("Can't drop embeddings table - it's not existing")
+        tdml.execute_sql(f"""
+            create volatile table emails_embeddings_store as (
+                select
+                    *
+            from mldb.ONNXEmbeddings(
+                    on emails.emails as InputTable
+                    on (select * from model_table where model_id = '{precision}') as ModelTable DIMENSION
+                    on (select model as tokenizer from tokenizer_table where model_id = 'tokenizer') as TokenizerTable DIMENSION
+                    using
+                        Accumulate('id', 'txt')
+                        ModelOutputTensor('sentence_embedding')
+                        EnableMemoryCheck('false')
+                        OutputFormat('FLOAT32({number_of_generated_embeddings})')
+                        OverwriteCachedModel('true')
+                ) a
+        ) with data on commit preserve rows
+        """)
+        print("Embeddings calculated")
+        print(f"Testing semantic search with cosine similiarity on the output of the model with precision '{precision}'...")
+        tdf_embeddings_store = tdml.DataFrame('emails_embeddings_store')
+        tdf_embeddings_store_tgt = tdf_embeddings_store[tdf_embeddings_store.id == 3]
+        tdf_embeddings_store_ref = tdf_embeddings_store[tdf_embeddings_store.id != 3]
+        cos_sim_pd = tdml.DataFrame.from_query(f"""
+            SELECT
+                dt.target_id,
+                dt.reference_id,
+                e_tgt.txt as target_txt,
+                e_ref.txt as reference_txt,
+                (1.0 - dt.distance) as similiarity
+            FROM
+                TD_VECTORDISTANCE (
+                    ON ({tdf_embeddings_store_tgt.show_query()}) AS TargetTable
+                    ON ({tdf_embeddings_store_ref.show_query()}) AS ReferenceTable DIMENSION
+                    USING
+                        TargetIDColumn('id')
+                        TargetFeatureColumns('[emb_0:emb_{number_of_generated_embeddings - 1}]')
+                        RefIDColumn('id')
+                        RefFeatureColumns('[emb_0:emb_{number_of_generated_embeddings - 1}]')
+                        DistanceMeasure('cosine')
+                        topk(3)
+                ) AS dt
+            JOIN emails.emails e_tgt on e_tgt.id = dt.target_id
+            JOIN emails.emails e_ref on e_ref.id = dt.reference_id;
+            """).to_pandas()
+        print(tabulate(cos_sim_pd, headers='keys', tablefmt='fancy_grid'))
+        print("Done\n\n")
+    tdml.remove_context()

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,59 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "100": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "101": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "102": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "103": {
+      "content": "[MASK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "clean_up_tokenization_spaces": false,
+  "cls_token": "[CLS]",
+  "do_basic_tokenize": true,
+  "do_lower_case": false,
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "max_len": 512,
+  "model_max_length": 512,
+  "never_split": null,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "strip_accents": null,
+  "tokenize_chinese_chars": true,
+  "tokenizer_class": "DistilBertTokenizer",
+  "unk_token": "[UNK]"
+}

vocab.txt ADDED Viewed

The diff for this file is too large to render. See raw diff