martinhillebrandtd commited on
Commit
849b5b0
·
1 Parent(s): 095c523
README.md CHANGED
@@ -1,3 +1,179 @@
1
- ---
2
- license: apache-2.0
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - multilingual
4
+ - ar
5
+ - zh
6
+ - nl
7
+ - en
8
+ - fr
9
+ - de
10
+ - it
11
+ - ko
12
+ - pl
13
+ - pt
14
+ - ru
15
+ - es
16
+ - tr
17
+ library_name: sentence-transformers
18
+ license: apache-2.0
19
+ pipeline_tag: sentence-similarity
20
+ tags:
21
+ - sentence-transformers
22
+ - feature-extraction
23
+ - sentence-similarity
24
+ - onnx
25
+ - teradata
26
+
27
+ ---
28
+ # A Teradata Vantage compatible Embeddings Model
29
+
30
+ # sentence-transformers/distiluse-base-multilingual-cased-v1
31
+
32
+ ## Overview of this Model
33
+
34
+ An Embedding Model which maps text (sentence/ paragraphs) into a vector. The [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) model well known for its effectiveness in capturing semantic meanings in text data. It's a state-of-the-art model trained on a large corpus, capable of generating high-quality text embeddings.
35
+
36
+ - 134.73M params (Sizes in ONNX format - "fp32": 515.61MB, "int8": 129.29MB, "uint8": 129.29MB)
37
+ - 512 maximum input tokens
38
+ - 512 dimensions of output vector
39
+ - Licence: apache-2.0. The released models can be used for commercial purposes free of charge.
40
+ - Reference to Original Model: https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1
41
+
42
+
43
+ ## Quickstart: Deploying this Model in Teradata Vantage
44
+
45
+ We have pre-converted the model into the ONNX format compatible with BYOM 6.0, eliminating the need for manual conversion.
46
+
47
+ **Note:** Ensure you have access to a Teradata Database with BYOM 6.0 installed.
48
+
49
+ To get started, clone the pre-converted model directly from the Teradata HuggingFace repository.
50
+
51
+
52
+ ```python
53
+
54
+ import teradataml as tdml
55
+ import getpass
56
+ from huggingface_hub import hf_hub_download
57
+
58
+ model_name = "distiluse-base-multilingual-cased-v1"
59
+ number_dimensions_output = 512
60
+ model_file_name = "model.onnx"
61
+
62
+ # Step 1: Download Model from Teradata HuggingFace Page
63
+
64
+ hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"onnx/{model_file_name}", local_dir="./")
65
+ hf_hub_download(repo_id=f"Teradata/{model_name}", filename=f"tokenizer.json", local_dir="./")
66
+
67
+ # Step 2: Create Connection to Vantage
68
+
69
+ tdml.create_context(host = input('enter your hostname'),
70
+ username=input('enter your username'),
71
+ password = getpass.getpass("enter your password"))
72
+
73
+ # Step 3: Load Models into Vantage
74
+ # a) Embedding model
75
+ tdml.save_byom(model_id = model_name, # must be unique in the models table
76
+ model_file = model_file_name,
77
+ table_name = 'embeddings_models' )
78
+ # b) Tokenizer
79
+ tdml.save_byom(model_id = model_name, # must be unique in the models table
80
+ model_file = 'tokenizer.json',
81
+ table_name = 'embeddings_tokenizers')
82
+
83
+ # Step 4: Test ONNXEmbeddings Function
84
+ # Note that ONNXEmbeddings expects the 'payload' column to be 'txt'.
85
+ # If it has got a different name, just rename it in a subquery/CTE.
86
+ input_table = "emails.emails"
87
+ embeddings_query = f"""
88
+ SELECT
89
+ *
90
+ from mldb.ONNXEmbeddings(
91
+ on {input_table} as InputTable
92
+ on (select * from embeddings_models where model_id = '{model_name}') as ModelTable DIMENSION
93
+ on (select model as tokenizer from embeddings_tokenizers where model_id = '{model_name}') as TokenizerTable DIMENSION
94
+ using
95
+ Accumulate('id', 'txt')
96
+ ModelOutputTensor('sentence_embedding')
97
+ EnableMemoryCheck('false')
98
+ OutputFormat('FLOAT32({number_dimensions_output})')
99
+ OverwriteCachedModel('true')
100
+ ) a
101
+ """
102
+ DF_embeddings = tdml.DataFrame.from_query(embeddings_query)
103
+ DF_embeddings
104
+ ```
105
+
106
+
107
+
108
+ ## What Can I Do with the Embeddings?
109
+
110
+ Teradata Vantage includes pre-built in-database functions to process embeddings further. Explore the following examples:
111
+
112
+ - **Semantic Clustering with TD_KMeans:** [Semantic Clustering Python Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/Semantic_Clustering_Python.ipynb)
113
+ - **Semantic Distance with TD_VectorDistance:** [Semantic Similarity Python Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/Semantic_Similarity_Python.ipynb)
114
+ - **RAG-Based Application with TD_VectorDistance:** [RAG and Bedrock Query PDF Notebook](https://github.com/Teradata/jupyter-demos/blob/main/UseCases/Language_Models_InVantage/RAG_and_Bedrock_QueryPDF.ipynb)
115
+
116
+
117
+ ## Deep Dive into Model Conversion to ONNX
118
+
119
+ **The steps below outline how we converted the open-source Hugging Face model into an ONNX file compatible with the in-database ONNXEmbeddings function.**
120
+
121
+ You do not need to perform these steps—they are provided solely for documentation and transparency. However, they may be helpful if you wish to convert another model to the required format.
122
+
123
+
124
+ ### Part 1. Importing and Converting Model using optimum
125
+
126
+ We start by importing the pre-trained [sentence-transformers/distiluse-base-multilingual-cased-v1](https://huggingface.co/sentence-transformers/distiluse-base-multilingual-cased-v1) model from Hugging Face.
127
+
128
+ To enhance performance and ensure compatibility with various execution environments, we'll use the [Optimum](https://github.com/huggingface/optimum) utility to convert the model into the ONNX (Open Neural Network Exchange) format.
129
+
130
+ After conversion to ONNX, we are fixing the opset in the ONNX file for compatibility with ONNX runtime used in Teradata Vantage
131
+
132
+ We are generating ONNX files for multiple different precisions: fp32, int8, uint8
133
+
134
+ You can find the detailed conversion steps in the file [convert.py](./convert.py)
135
+
136
+ ### Part 2. Running the model in Python with onnxruntime & compare results
137
+
138
+ Once the fixes are applied, we proceed to test the correctness of the ONNX model by calculating cosine similarity between two texts using native SentenceTransformers and ONNX runtime, comparing the results.
139
+
140
+ If the results are identical, it confirms that the ONNX model gives the same result as the native models, validating its correctness and suitability for further use in the database.
141
+
142
+
143
+ ```python
144
+ import onnxruntime as rt
145
+
146
+ from sentence_transformers.util import cos_sim
147
+ from sentence_transformers import SentenceTransformer
148
+
149
+ import transformers
150
+
151
+
152
+ sentences_1 = 'How is the weather today?'
153
+ sentences_2 = 'What is the current weather like today?'
154
+
155
+ # Calculate ONNX result
156
+ tokenizer = transformers.AutoTokenizer.from_pretrained("sentence-transformers/distiluse-base-multilingual-cased-v1")
157
+ predef_sess = rt.InferenceSession("onnx/model.onnx")
158
+
159
+ enc1 = tokenizer(sentences_1)
160
+ embeddings_1_onnx = predef_sess.run(None, {"input_ids": [enc1.input_ids],
161
+ "attention_mask": [enc1.attention_mask]})
162
+
163
+ enc2 = tokenizer(sentences_2)
164
+ embeddings_2_onnx = predef_sess.run(None, {"input_ids": [enc2.input_ids],
165
+ "attention_mask": [enc2.attention_mask]})
166
+
167
+
168
+ # Calculate embeddings with SentenceTransformer
169
+ model = SentenceTransformer(model_id, trust_remote_code=True)
170
+ embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True, trust_remote_code=True)
171
+ embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True, trust_remote_code=True)
172
+
173
+ # Compare results
174
+ print("Cosine similiarity for embeddings calculated with ONNX:" + str(cos_sim(embeddings_1_onnx[1][0], embeddings_2_onnx[1][0])))
175
+ print("Cosine similiarity for embeddings calculated with SentenceTransformer:" + str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))
176
+ ```
177
+
178
+ You can find the detailed ONNX vs. SentenceTransformer result comparison steps in the file [test_local.py](./test_local.py)
179
+
config.json ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_attn_implementation_autoset": true,
3
+ "_name_or_path": "sentence-transformers/distiluse-base-multilingual-cased-v1",
4
+ "activation": "gelu",
5
+ "architectures": [
6
+ "DistilBertModel"
7
+ ],
8
+ "attention_dropout": 0.1,
9
+ "dim": 768,
10
+ "dropout": 0.1,
11
+ "export_model_type": "transformer",
12
+ "hidden_dim": 3072,
13
+ "initializer_range": 0.02,
14
+ "max_position_embeddings": 512,
15
+ "model_type": "distilbert",
16
+ "n_heads": 12,
17
+ "n_layers": 6,
18
+ "pad_token_id": 0,
19
+ "qa_dropout": 0.1,
20
+ "seq_classif_dropout": 0.2,
21
+ "sinusoidal_pos_embds": false,
22
+ "tie_weights_": true,
23
+ "transformers_version": "4.47.1",
24
+ "vocab_size": 119547
25
+ }
conversion_config.json ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "model_id": "sentence-transformers/distiluse-base-multilingual-cased-v1",
3
+ "number_of_generated_embeddings": 512,
4
+ "precision_to_filename_map": {
5
+ "fp32": "onnx/model.onnx",
6
+ "int8": "onnx/model_int8.onnx",
7
+ "uint8": "onnx/model_uint8.onnx"
8
+ },
9
+ "opset": 16,
10
+ "IR": 8
11
+ }
convert.py ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import json
3
+ import shutil
4
+
5
+ from optimum.exporters.onnx import main_export
6
+ import onnx
7
+ from onnxconverter_common import float16
8
+ import onnxruntime as rt
9
+ from onnxruntime.tools.onnx_model_utils import *
10
+ from onnxruntime.quantization import quantize_dynamic, QuantType
11
+
12
+ with open('conversion_config.json') as json_file:
13
+ conversion_config = json.load(json_file)
14
+
15
+
16
+ model_id = conversion_config["model_id"]
17
+ number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
18
+ precision_to_filename_map = conversion_config["precision_to_filename_map"]
19
+ opset = conversion_config["opset"]
20
+ IR = conversion_config["IR"]
21
+
22
+
23
+ op = onnx.OperatorSetIdProto()
24
+ op.version = opset
25
+
26
+
27
+ if not os.path.exists("onnx"):
28
+ os.makedirs("onnx")
29
+
30
+ print("Exporting the main model version")
31
+
32
+ main_export(model_name_or_path=model_id, output="./", opset=opset, trust_remote_code=True, task="feature-extraction", dtype="fp32")
33
+
34
+ if "fp32" in precision_to_filename_map:
35
+ print("Exporting the fp32 onnx file...")
36
+
37
+ shutil.copyfile('model.onnx', precision_to_filename_map["fp32"])
38
+
39
+ print("Done\n\n")
40
+
41
+ if "int8" in precision_to_filename_map:
42
+ print("Quantizing fp32 model to int8...")
43
+ quantize_dynamic("model.onnx", precision_to_filename_map["int8"], weight_type=QuantType.QInt8)
44
+ print("Done\n\n")
45
+
46
+ if "uint8" in precision_to_filename_map:
47
+ print("Quantizing fp32 model to uint8...")
48
+ quantize_dynamic("model.onnx", precision_to_filename_map["uint8"], weight_type=QuantType.QUInt8)
49
+ print("Done\n\n")
50
+
51
+ os.remove("model.onnx")
onnx/model.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:613abac622b2e5b220e11f54dd41bf6bdff499e19b16e17fcd94291e02c3bdaa
3
+ size 540655997
onnx/model_int8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:560dc998fbbcfe94bfb9173bc8e57cb8bb2daf1cf40c2b21ae86b6694bfcfeb7
3
+ size 135566221
onnx/model_uint8.onnx ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:996a7bdda423a2f2466a11e7dec368f961a130e4481f2d43fecf158c27ffe065
3
+ size 135566238
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
test_local.py ADDED
@@ -0,0 +1,49 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import onnxruntime as rt
2
+
3
+ from sentence_transformers.util import cos_sim
4
+ from sentence_transformers import SentenceTransformer
5
+
6
+ import transformers
7
+
8
+ import gc
9
+ import json
10
+
11
+
12
+ with open('conversion_config.json') as json_file:
13
+ conversion_config = json.load(json_file)
14
+
15
+
16
+ model_id = conversion_config["model_id"]
17
+ number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
18
+ precision_to_filename_map = conversion_config["precision_to_filename_map"]
19
+
20
+ sentences_1 = 'How is the weather today?'
21
+ sentences_2 = 'What is the current weather like today?'
22
+
23
+ print(f"Testing on cosine similiarity between sentences: \n'{sentences_1}'\n'{sentences_2}'\n\n\n")
24
+
25
+ tokenizer = transformers.AutoTokenizer.from_pretrained("./")
26
+ enc1 = tokenizer(sentences_1)
27
+ enc2 = tokenizer(sentences_2)
28
+
29
+ for precision, file_name in precision_to_filename_map.items():
30
+
31
+
32
+ onnx_session = rt.InferenceSession(file_name)
33
+ embeddings_1_onnx = onnx_session.run(None, {"input_ids": [enc1.input_ids],
34
+ "attention_mask": [enc1.attention_mask]})[1][0]
35
+
36
+ embeddings_2_onnx = onnx_session.run(None, {"input_ids": [enc2.input_ids],
37
+ "attention_mask": [enc2.attention_mask]})[1][0]
38
+
39
+ del onnx_session
40
+ gc.collect()
41
+ print(f'Cosine similiarity for ONNX model with precision "{precision}" is {str(cos_sim(embeddings_1_onnx, embeddings_2_onnx))}')
42
+
43
+
44
+
45
+
46
+ model = SentenceTransformer(model_id, trust_remote_code=True)
47
+ embeddings_1_sentence_transformer = model.encode(sentences_1, normalize_embeddings=True, trust_remote_code=True)
48
+ embeddings_2_sentence_transformer = model.encode(sentences_2, normalize_embeddings=True, trust_remote_code=True)
49
+ print('Cosine similiarity for original sentence transformer model is '+str(cos_sim(embeddings_1_sentence_transformer, embeddings_2_sentence_transformer)))
test_teradata.py ADDED
@@ -0,0 +1,106 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import sys
2
+ import teradataml as tdml
3
+ from tabulate import tabulate
4
+
5
+ import json
6
+
7
+
8
+ with open('conversion_config.json') as json_file:
9
+ conversion_config = json.load(json_file)
10
+
11
+
12
+ model_id = conversion_config["model_id"]
13
+ number_of_generated_embeddings = conversion_config["number_of_generated_embeddings"]
14
+ precision_to_filename_map = conversion_config["precision_to_filename_map"]
15
+
16
+ host = sys.argv[1]
17
+ username = sys.argv[2]
18
+ password = sys.argv[3]
19
+
20
+ print("Setting up connection to teradata...")
21
+ tdml.create_context(host = host, username = username, password = password)
22
+ print("Done\n\n")
23
+
24
+
25
+ print("Deploying tokenizer...")
26
+ try:
27
+ tdml.db_drop_table('tokenizer_table')
28
+ except:
29
+ print("Can't drop tokenizers table - it's not existing")
30
+ tdml.save_byom('tokenizer',
31
+ 'tokenizer.json',
32
+ 'tokenizer_table')
33
+ print("Done\n\n")
34
+
35
+ print("Testing models...")
36
+ try:
37
+ tdml.db_drop_table('model_table')
38
+ except:
39
+ print("Can't drop models table - it's not existing")
40
+
41
+ for precision, file_name in precision_to_filename_map.items():
42
+ print(f"Deploying {precision} model...")
43
+ tdml.save_byom(precision,
44
+ file_name,
45
+ 'model_table')
46
+ print(f"Model {precision} is deployed\n")
47
+
48
+ print(f"Calculating embeddings with {precision} model...")
49
+ try:
50
+ tdml.db_drop_table('emails_embeddings_store')
51
+ except:
52
+ print("Can't drop embeddings table - it's not existing")
53
+
54
+ tdml.execute_sql(f"""
55
+ create volatile table emails_embeddings_store as (
56
+ select
57
+ *
58
+ from mldb.ONNXEmbeddings(
59
+ on emails.emails as InputTable
60
+ on (select * from model_table where model_id = '{precision}') as ModelTable DIMENSION
61
+ on (select model as tokenizer from tokenizer_table where model_id = 'tokenizer') as TokenizerTable DIMENSION
62
+
63
+ using
64
+ Accumulate('id', 'txt')
65
+ ModelOutputTensor('sentence_embedding')
66
+ EnableMemoryCheck('false')
67
+ OutputFormat('FLOAT32({number_of_generated_embeddings})')
68
+ OverwriteCachedModel('true')
69
+ ) a
70
+ ) with data on commit preserve rows
71
+
72
+ """)
73
+ print("Embeddings calculated")
74
+ print(f"Testing semantic search with cosine similiarity on the output of the model with precision '{precision}'...")
75
+ tdf_embeddings_store = tdml.DataFrame('emails_embeddings_store')
76
+ tdf_embeddings_store_tgt = tdf_embeddings_store[tdf_embeddings_store.id == 3]
77
+
78
+ tdf_embeddings_store_ref = tdf_embeddings_store[tdf_embeddings_store.id != 3]
79
+
80
+ cos_sim_pd = tdml.DataFrame.from_query(f"""
81
+ SELECT
82
+ dt.target_id,
83
+ dt.reference_id,
84
+ e_tgt.txt as target_txt,
85
+ e_ref.txt as reference_txt,
86
+ (1.0 - dt.distance) as similiarity
87
+ FROM
88
+ TD_VECTORDISTANCE (
89
+ ON ({tdf_embeddings_store_tgt.show_query()}) AS TargetTable
90
+ ON ({tdf_embeddings_store_ref.show_query()}) AS ReferenceTable DIMENSION
91
+ USING
92
+ TargetIDColumn('id')
93
+ TargetFeatureColumns('[emb_0:emb_{number_of_generated_embeddings - 1}]')
94
+ RefIDColumn('id')
95
+ RefFeatureColumns('[emb_0:emb_{number_of_generated_embeddings - 1}]')
96
+ DistanceMeasure('cosine')
97
+ topk(3)
98
+ ) AS dt
99
+ JOIN emails.emails e_tgt on e_tgt.id = dt.target_id
100
+ JOIN emails.emails e_ref on e_ref.id = dt.reference_id;
101
+ """).to_pandas()
102
+ print(tabulate(cos_sim_pd, headers='keys', tablefmt='fancy_grid'))
103
+ print("Done\n\n")
104
+
105
+
106
+ tdml.remove_context()
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,59 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": false,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": false,
48
+ "extra_special_tokens": {},
49
+ "mask_token": "[MASK]",
50
+ "max_len": 512,
51
+ "model_max_length": 512,
52
+ "never_split": null,
53
+ "pad_token": "[PAD]",
54
+ "sep_token": "[SEP]",
55
+ "strip_accents": null,
56
+ "tokenize_chinese_chars": true,
57
+ "tokenizer_class": "DistilBertTokenizer",
58
+ "unk_token": "[UNK]"
59
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff