jebish7 commited on
Commit
bdc96e0
·
verified ·
1 Parent(s): c6bbc48

Add new SentenceTransformer model

Browse files
README.md ADDED
@@ -0,0 +1,491 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ tags:
3
+ - sentence-transformers
4
+ - sentence-similarity
5
+ - feature-extraction
6
+ - generated_from_trainer
7
+ - dataset_size:29545
8
+ - loss:MultipleNegativesSymmetricRankingLoss
9
+ base_model: jxm/cde-small-v2
10
+ widget:
11
+ - source_sentence: In the context of the risk-based assessment of customers and business
12
+ relationships, how should the overlap between customer risk assessment and CDD
13
+ be managed to ensure both are completed effectively and in compliance with ADGM
14
+ regulations?
15
+ sentences:
16
+ - 'DocumentID: 36 | PassageID: D.7. | Passage: Principle 7 – Scenario analysis of
17
+ climate-related financial risks. Where appropriate, relevant financial firms should
18
+ develop and implement climate-related scenario analysis frameworks, including
19
+ stress testing, in a manner commensurate with their size, complexity, risk profile
20
+ and nature of activities.
21
+
22
+ '
23
+ - 'DocumentID: 1 | PassageID: 7.Guidance.4. | Passage: The risk-based assessment
24
+ of the customer and the proposed business relationship, Transaction or product
25
+ required under this Chapter is required to be undertaken prior to the establishment
26
+ of a business relationship with a customer. Because the risk rating assigned to
27
+ a customer resulting from this assessment determines the level of CDD that must
28
+ be undertaken for that customer, this process must be completed before the CDD
29
+ is completed for the customer. The Regulator is aware that in practice there will
30
+ often be some degree of overlap between the customer risk assessment and CDD.
31
+ For example, a Relevant Person may undertake some aspects of CDD, such as identifying
32
+ Beneficial Owners, when it performs a risk assessment of the customer. Conversely,
33
+ a Relevant Person may also obtain relevant information as part of CDD which has
34
+ an impact on its customer risk assessment. Where information obtained as part
35
+ of CDD of a customer affects the risk rating of a customer, the change in risk
36
+ rating should be reflected in the degree of CDD undertaken.'
37
+ - 'DocumentID: 1 | PassageID: 9.1.2.Guidance.4. | Passage: Where the legislative
38
+ framework of a jurisdiction (such as secrecy or data protection legislation) prevents
39
+ a Relevant Person from having access to CDD information upon request without delay
40
+ as referred to in Rule ‎9.1.1(3)(b), the Relevant Person should undertake the
41
+ relevant CDD itself and should not seek to rely on the relevant third party.'
42
+ - source_sentence: Can you clarify the responsibilities of the Governing Body of a
43
+ Relevant Person in establishing and maintaining AML/TFS policies and procedures,
44
+ and how these should be documented and reviewed?
45
+ sentences:
46
+ - 'DocumentID: 28 | PassageID: 193) | Passage: SUPERVISION BY LISTING AUTHORITY
47
+
48
+ Complaints or allegations of non-compliance by Reporting Entities
49
+
50
+ If, as a result of the enquiry, the Listing Authority forms the view that the
51
+ information is accurate, is Inside Information, and is not within exemption from
52
+ Disclosure provided by Rule 7.2.2, the Listing Authority will ask the Reporting
53
+ Entity to make a Disclosure about the matter under Rule 7.2.1. If the information
54
+ should have been Disclosed earlier, the Listing Authority may issue an ‘aware
55
+ letter’ (see paragraphs 187 to 189 above), or take other relevant action.
56
+
57
+
58
+ '
59
+ - "DocumentID: 17 | PassageID: Part 13.165.(2) | Passage: The Regulator shall not\
60
+ \ approve a Non Abu Dhabi Global Market Clearing House unless it is satisfied—\n\
61
+ (a)\tthat the rules and practices of the body, together with the law of the country\
62
+ \ in which the body's head office is situated, provide adequate procedures for\
63
+ \ dealing with the default of persons party to contracts connected with the body;\
64
+ \ and\n(b)\tthat it is otherwise appropriate to approve the body;\ntogether being\
65
+ \ the “Relevant Requirements” for this Part."
66
+ - "DocumentID: 1 | PassageID: 4.3.1 | Passage: A Relevant Person which is part of\
67
+ \ a Group must ensure that it:\n(a)\thas developed and implemented policies and\
68
+ \ procedures for the sharing of information between Group entities, including\
69
+ \ the sharing of information relating to CDD and money laundering risks;\n(b)\t\
70
+ has in place adequate safeguards on the confidentiality and use of information\
71
+ \ exchanged between Group entities, including consideration of relevant data protection\
72
+ \ legislation;\n(c)\tremains aware of the money laundering risks of the Group\
73
+ \ as a whole and of its exposure to the Group and takes active steps to mitigate\
74
+ \ such risks;\n(d)\tcontributes to a Group-wide risk assessment to identify and\
75
+ \ assess money laundering risks for the Group; and\n(e)\tprovides its Group-wide\
76
+ \ compliance, audit and AML/TFS functions with customer account and Transaction\
77
+ \ information from its Branches and Subsidiaries when necessary for AML/TFS purposes."
78
+ - source_sentence: What specific accounting standards and practices are we required
79
+ to follow when valuing positions in our Trading and Non-Trading Books to ensure
80
+ compliance with ADGM regulations?
81
+ sentences:
82
+ - 'DocumentID: 7 | PassageID: 8.10.1.(2).Guidance.3. | Passage: Each Authorised
83
+ Person, Recognised Body and its Auditors is also required under Part 16 and section
84
+ 193 of the FSMR respectively, to disclose to the Regulator any matter which may
85
+ indicate a breach or likely breach of, or a failure or likely failure to comply
86
+ with, Regulations or Rules. Each Authorised Person and Recognised Body is also
87
+ required to establish and implement systems and procedures to enable its compliance
88
+ and compliance by its Auditors with notification requirements.
89
+
90
+ '
91
+ - "DocumentID: 18 | PassageID: 3.2 | Passage: Financial Services Permissions. VC\
92
+ \ Managers operating in ADGM require a Financial Services Permission (“FSP”) to\
93
+ \ undertake any Regulated Activity pertaining to VC Funds and/or co-investments\
94
+ \ by third parties in VC Funds. The Regulated Activities covered by the FSP will\
95
+ \ be dependent on the VC Managers’ investment strategy and business model.\n(a)\t\
96
+ Managing a Collective Investment Fund: this includes carrying out fund management\
97
+ \ activities in respect of a VC Fund.\n(b)\tAdvising on Investments or Credit\
98
+ \ : for VC Managers these activities will be restricted to activities related\
99
+ \ to co-investment alongside a VC Fund which the VC Manager manages, such as recommending\
100
+ \ that a client invest in an investee company alongside the VC Fund and on the\
101
+ \ strategy and structure required to make the investment.\n(c)\tArranging Deals\
102
+ \ in Investments: VC Managers may also wish to make arrangements to facilitate\
103
+ \ co-investments in the investee company.\nAuthorisation fees and supervision\
104
+ \ fees for a VC Manager are capped at USD 10,000 regardless of whether one or\
105
+ \ both of the additional Regulated Activities in b) and c) above in relation to\
106
+ \ co-investments are included in its FSP. The FSP will include restrictions appropriate\
107
+ \ to the business model of a VC Manager."
108
+ - 'DocumentID: 13 | PassageID: APP2.A2.1.1.(4) | Passage: An Authorised Person must
109
+ value every position included in its Trading Book and the Non Trading Book in
110
+ accordance with the relevant accounting standards and practices.
111
+
112
+ '
113
+ - source_sentence: What documentation and information are we required to maintain
114
+ to demonstrate compliance with the rules pertaining to the cooperation with auditors,
115
+ especially in terms of providing access and not interfering with their duties?
116
+ sentences:
117
+ - "DocumentID: 6 | PassageID: PART 5.16.3.5 | Passage: Co-operation with auditors.\
118
+ \ A Fund Manager must take reasonable steps to ensure that it and its Employees:\n\
119
+ (a)\tprovide any information to its auditor that its auditor reasonably requires,\
120
+ \ or is entitled to receive as auditor;\n(b)\tgive the auditor right of access\
121
+ \ at all reasonable times to relevant records and information within its possession;\n\
122
+ (c)\tallow the auditor to make copies of any records or information referred to\
123
+ \ in ‎(b);\n(d)\tdo not interfere with the auditor's ability to discharge its\
124
+ \ duties;\n(e)\treport to the auditor any matter which may significantly affect\
125
+ \ the financial position of the Fund; and\n(f)\tprovide such other assistance\
126
+ \ as the auditor may reasonably request it to provide."
127
+ - "DocumentID: 13 | PassageID: 4.3.1 | Passage: An Authorised Person must implement\
128
+ \ and maintain comprehensive Credit Risk management systems which:\n(a)\tare appropriate\
129
+ \ to the firm's type, scope, complexity and scale of operations;\n(b)\tare appropriate\
130
+ \ to the diversity of its operations, including geographical diversity;\n(c)\t\
131
+ enable the firm to effectively identify, assess, monitor and control Credit Risk\
132
+ \ and to ensure that adequate Capital Resources are available at all times to\
133
+ \ cover the risks assumed; and\n(d)\tensure effective implementation of the Credit\
134
+ \ Risk strategy and policy."
135
+ - 'DocumentID: 3 | PassageID: 3.8.9 | Passage: The Authorised Person acting as the
136
+ Investment Manager of an ADGM Green Portfolio must provide a copy of the attestation
137
+ obtained for the purposes of Rule ‎3.8.6 to each Client with whom it has entered
138
+ into a Discretionary Portfolio Management Agreement in respect of such ADGM Green
139
+ Portfolio at least on an annual basis and upon request by the Client.'
140
+ - source_sentence: Could you provide examples of circumstances that, when changed,
141
+ would necessitate the reevaluation of a customer's risk assessment and the application
142
+ of updated CDD measures?
143
+ sentences:
144
+ - 'DocumentID: 13 | PassageID: 9.2.1.Guidance.1. | Passage: The Regulator expects
145
+ that an Authorised Person''s Liquidity Risk strategy will set out the approach
146
+ that the Authorised Person will take to Liquidity Risk management, including various
147
+ quantitative and qualitative targets. It should be communicated to all relevant
148
+ functions and staff within the organisation and be set out in the Authorised Person''s
149
+ Liquidity Risk policy.'
150
+ - "DocumentID: 1 | PassageID: 8.1.2.(1) | Passage: A Relevant Person must also apply\
151
+ \ CDD measures to each existing customer under Rules ‎8.3.1, ‎8.4.1 or ‎8.5.1\
152
+ \ as applicable:\n(a)\twith a frequency appropriate to the outcome of the risk-based\
153
+ \ approach taken in relation to each customer; and\n(b)\twhen the Relevant Person\
154
+ \ becomes aware that any circumstances relevant to its risk assessment for a customer\
155
+ \ have changed."
156
+ - "DocumentID: 1 | PassageID: 8.1.1.Guidance.2. | Passage: The FIU has issued guides\
157
+ \ that require:\n(a)\ta DNFBP that is a dealer in precious metals or precious\
158
+ \ stones to obtain relevant identification documents, such as passport, emirates\
159
+ \ ID, trade licence, as applicable, and register the information via goAML for\
160
+ \ all cash transactions equal to or exceeding USD15,000 with individuals and all\
161
+ \ cash or wire transfer transactions equal to or exceeding USD15,000 with entities.\
162
+ \ The Regulator expects a dealer in any saleable item or a price equal to or greater\
163
+ \ than USD15,000 to also comply with this requirement;\n(b)\ta DNFBP that is a\
164
+ \ real estate agent to obtain relevant identification documents, such as passport,\
165
+ \ emirates ID, trade licence, as applicable, and register the information via\
166
+ \ goAML for all sales or purchases of Real Property where:\n(i)\tthe payment for\
167
+ \ the sale/purchase includes a total cash payment of USD15,000 or more whether\
168
+ \ in a single cash payment or multiple cash payments;\n(ii)\tthe payment for any\
169
+ \ part or all of the sale/purchase amount includes payment(s) using Virtual Assets;\n\
170
+ (iii)\tthe payment for any part or all of the sale/purchase amount includes funds\
171
+ \ that were converted from or to a Virtual Asset."
172
+ pipeline_tag: sentence-similarity
173
+ library_name: sentence-transformers
174
+ ---
175
+
176
+ # SentenceTransformer based on jxm/cde-small-v2
177
+
178
+ This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2) on the csv dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
179
+
180
+ ## Model Details
181
+
182
+ ### Model Description
183
+ - **Model Type:** Sentence Transformer
184
+ - **Base model:** [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2) <!-- at revision 287bf0ea6ebfecf2339762d0ef28fb846959a8f2 -->
185
+ - **Maximum Sequence Length:** None tokens
186
+ - **Output Dimensionality:** None dimensions
187
+ - **Similarity Function:** Cosine Similarity
188
+ - **Training Dataset:**
189
+ - csv
190
+ <!-- - **Language:** Unknown -->
191
+ <!-- - **License:** Unknown -->
192
+
193
+ ### Model Sources
194
+
195
+ - **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
196
+ - **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
197
+ - **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
198
+
199
+ ### Full Model Architecture
200
+
201
+ ```
202
+ SentenceTransformer(
203
+ (0): Transformer({}) with Transformer model: ContextualDocumentEmbeddingTransformer
204
+ )
205
+ ```
206
+
207
+ ## Usage
208
+
209
+ ### Direct Usage (Sentence Transformers)
210
+
211
+ First install the Sentence Transformers library:
212
+
213
+ ```bash
214
+ pip install -U sentence-transformers
215
+ ```
216
+
217
+ Then you can load this model and run inference.
218
+ ```python
219
+ from sentence_transformers import SentenceTransformer
220
+
221
+ # Download from the 🤗 Hub
222
+ model = SentenceTransformer("jebish7/cde-v2-obliqa-1")
223
+ # Run inference
224
+ sentences = [
225
+ "Could you provide examples of circumstances that, when changed, would necessitate the reevaluation of a customer's risk assessment and the application of updated CDD measures?",
226
+ 'DocumentID: 1 | PassageID: 8.1.2.(1) | Passage: A Relevant Person must also apply CDD measures to each existing customer under Rules \u200e8.3.1, \u200e8.4.1 or \u200e8.5.1 as applicable:\n(a)\twith a frequency appropriate to the outcome of the risk-based approach taken in relation to each customer; and\n(b)\twhen the Relevant Person becomes aware that any circumstances relevant to its risk assessment for a customer have changed.',
227
+ "DocumentID: 13 | PassageID: 9.2.1.Guidance.1. | Passage: The Regulator expects that an Authorised Person's Liquidity Risk strategy will set out the approach that the Authorised Person will take to Liquidity Risk management, including various quantitative and qualitative targets. It should be communicated to all relevant functions and staff within the organisation and be set out in the Authorised Person's Liquidity Risk policy.",
228
+ ]
229
+ embeddings = model.encode(sentences)
230
+ print(embeddings.shape)
231
+ # [3, 1024]
232
+
233
+ # Get the similarity scores for the embeddings
234
+ similarities = model.similarity(embeddings, embeddings)
235
+ print(similarities.shape)
236
+ # [3, 3]
237
+ ```
238
+
239
+ <!--
240
+ ### Direct Usage (Transformers)
241
+
242
+ <details><summary>Click to see the direct usage in Transformers</summary>
243
+
244
+ </details>
245
+ -->
246
+
247
+ <!--
248
+ ### Downstream Usage (Sentence Transformers)
249
+
250
+ You can finetune this model on your own dataset.
251
+
252
+ <details><summary>Click to expand</summary>
253
+
254
+ </details>
255
+ -->
256
+
257
+ <!--
258
+ ### Out-of-Scope Use
259
+
260
+ *List how the model may foreseeably be misused and address what users ought not to do with the model.*
261
+ -->
262
+
263
+ <!--
264
+ ## Bias, Risks and Limitations
265
+
266
+ *What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
267
+ -->
268
+
269
+ <!--
270
+ ### Recommendations
271
+
272
+ *What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
273
+ -->
274
+
275
+ ## Training Details
276
+
277
+ ### Training Dataset
278
+
279
+ #### csv
280
+
281
+ * Dataset: csv
282
+ * Size: 29,545 training samples
283
+ * Columns: <code>anchor</code> and <code>positive</code>
284
+ * Approximate statistics based on the first 1000 samples:
285
+ | | anchor | positive |
286
+ |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
287
+ | type | string | string |
288
+ | details | <ul><li>min: 17 tokens</li><li>mean: 35.21 tokens</li><li>max: 66 tokens</li></ul> | <ul><li>min: 29 tokens</li><li>mean: 143.53 tokens</li><li>max: 512 tokens</li></ul> |
289
+ * Samples:
290
+ | anchor | positive |
291
+ |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
292
+ | <code>Could you outline the expected procedures for a Trade Repository to notify relevant authorities of any significant errors or omissions in previously submitted data?</code> | <code>DocumentID: 7 | PassageID: APP2.A2.1.2 | Passage: Processes and procedures. A Trade Repository must have effective processes and procedures to provide data to relevant authorities in a timely and appropriate manner to enable them to meet their respective regulatory mandates and legal responsibilities.</code> |
293
+ | <code>In the context of a non-binding MPO, how are commodities held by an Authorised Person treated for the purpose of determining the Commodities Risk Capital Requirement?</code> | <code>DocumentID: 9 | PassageID: 5.4.13.(a) | Passage: Commodities held by an Authorised Person for selling or leasing when executing a Murabaha, non-binding MPO, Salam or parallel Salam contract must be included in the calculation of its Commodities Risk Capital Requirement.</code> |
294
+ | <code>Can the FSRA provide case studies or examples of best practices for RIEs operating MTFs or OTFs using spot commodities in line with the Spot Commodities Framework?</code> | <code>DocumentID: 34 | PassageID: 77) | Passage: REGULATORY REQUIREMENTS - SPOT COMMODITY ACTIVITIES<br>RIEs operating an MTF or OTF using Accepted Spot Commodities<br>This means that an RIE (in addition to operating markets relating to the trading of Financial Instruments) can, where permitted by the FSRA and subject to MIR Rule 3.4.2, operate a separate MTF or OTF under its Recognition Order. This MTF or OTF may operate using Accepted Spot Commodities.<br></code> |
295
+ * Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
296
+ ```json
297
+ {
298
+ "scale": 20.0,
299
+ "similarity_fct": "cos_sim"
300
+ }
301
+ ```
302
+
303
+ ### Training Hyperparameters
304
+ #### Non-Default Hyperparameters
305
+
306
+ - `per_device_train_batch_size`: 12
307
+ - `num_train_epochs`: 1
308
+ - `warmup_ratio`: 0.1
309
+ - `batch_sampler`: no_duplicates
310
+
311
+ #### All Hyperparameters
312
+ <details><summary>Click to expand</summary>
313
+
314
+ - `overwrite_output_dir`: False
315
+ - `do_predict`: False
316
+ - `eval_strategy`: no
317
+ - `prediction_loss_only`: True
318
+ - `per_device_train_batch_size`: 12
319
+ - `per_device_eval_batch_size`: 8
320
+ - `per_gpu_train_batch_size`: None
321
+ - `per_gpu_eval_batch_size`: None
322
+ - `gradient_accumulation_steps`: 1
323
+ - `eval_accumulation_steps`: None
324
+ - `torch_empty_cache_steps`: None
325
+ - `learning_rate`: 5e-05
326
+ - `weight_decay`: 0.0
327
+ - `adam_beta1`: 0.9
328
+ - `adam_beta2`: 0.999
329
+ - `adam_epsilon`: 1e-08
330
+ - `max_grad_norm`: 1.0
331
+ - `num_train_epochs`: 1
332
+ - `max_steps`: -1
333
+ - `lr_scheduler_type`: linear
334
+ - `lr_scheduler_kwargs`: {}
335
+ - `warmup_ratio`: 0.1
336
+ - `warmup_steps`: 0
337
+ - `log_level`: passive
338
+ - `log_level_replica`: warning
339
+ - `log_on_each_node`: True
340
+ - `logging_nan_inf_filter`: True
341
+ - `save_safetensors`: True
342
+ - `save_on_each_node`: False
343
+ - `save_only_model`: False
344
+ - `restore_callback_states_from_checkpoint`: False
345
+ - `no_cuda`: False
346
+ - `use_cpu`: False
347
+ - `use_mps_device`: False
348
+ - `seed`: 42
349
+ - `data_seed`: None
350
+ - `jit_mode_eval`: False
351
+ - `use_ipex`: False
352
+ - `bf16`: False
353
+ - `fp16`: False
354
+ - `fp16_opt_level`: O1
355
+ - `half_precision_backend`: auto
356
+ - `bf16_full_eval`: False
357
+ - `fp16_full_eval`: False
358
+ - `tf32`: None
359
+ - `local_rank`: 0
360
+ - `ddp_backend`: None
361
+ - `tpu_num_cores`: None
362
+ - `tpu_metrics_debug`: False
363
+ - `debug`: []
364
+ - `dataloader_drop_last`: False
365
+ - `dataloader_num_workers`: 0
366
+ - `dataloader_prefetch_factor`: None
367
+ - `past_index`: -1
368
+ - `disable_tqdm`: False
369
+ - `remove_unused_columns`: True
370
+ - `label_names`: None
371
+ - `load_best_model_at_end`: False
372
+ - `ignore_data_skip`: False
373
+ - `fsdp`: []
374
+ - `fsdp_min_num_params`: 0
375
+ - `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
376
+ - `fsdp_transformer_layer_cls_to_wrap`: None
377
+ - `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
378
+ - `deepspeed`: None
379
+ - `label_smoothing_factor`: 0.0
380
+ - `optim`: adamw_torch
381
+ - `optim_args`: None
382
+ - `adafactor`: False
383
+ - `group_by_length`: False
384
+ - `length_column_name`: length
385
+ - `ddp_find_unused_parameters`: None
386
+ - `ddp_bucket_cap_mb`: None
387
+ - `ddp_broadcast_buffers`: False
388
+ - `dataloader_pin_memory`: True
389
+ - `dataloader_persistent_workers`: False
390
+ - `skip_memory_metrics`: True
391
+ - `use_legacy_prediction_loop`: False
392
+ - `push_to_hub`: False
393
+ - `resume_from_checkpoint`: None
394
+ - `hub_model_id`: None
395
+ - `hub_strategy`: every_save
396
+ - `hub_private_repo`: None
397
+ - `hub_always_push`: False
398
+ - `gradient_checkpointing`: False
399
+ - `gradient_checkpointing_kwargs`: None
400
+ - `include_inputs_for_metrics`: False
401
+ - `include_for_metrics`: []
402
+ - `eval_do_concat_batches`: True
403
+ - `fp16_backend`: auto
404
+ - `push_to_hub_model_id`: None
405
+ - `push_to_hub_organization`: None
406
+ - `mp_parameters`:
407
+ - `auto_find_batch_size`: False
408
+ - `full_determinism`: False
409
+ - `torchdynamo`: None
410
+ - `ray_scope`: last
411
+ - `ddp_timeout`: 1800
412
+ - `torch_compile`: False
413
+ - `torch_compile_backend`: None
414
+ - `torch_compile_mode`: None
415
+ - `dispatch_batches`: None
416
+ - `split_batches`: None
417
+ - `include_tokens_per_second`: False
418
+ - `include_num_input_tokens_seen`: False
419
+ - `neftune_noise_alpha`: None
420
+ - `optim_target_modules`: None
421
+ - `batch_eval_metrics`: False
422
+ - `eval_on_start`: False
423
+ - `use_liger_kernel`: False
424
+ - `eval_use_gather_object`: False
425
+ - `average_tokens_across_devices`: False
426
+ - `prompts`: None
427
+ - `batch_sampler`: no_duplicates
428
+ - `multi_dataset_batch_sampler`: proportional
429
+
430
+ </details>
431
+
432
+ ### Training Logs
433
+ | Epoch | Step | Training Loss |
434
+ |:------:|:----:|:-------------:|
435
+ | 0.0812 | 100 | 1.7126 |
436
+ | 0.1623 | 200 | 0.7412 |
437
+ | 0.2435 | 300 | 0.6673 |
438
+ | 0.3247 | 400 | 0.6119 |
439
+ | 0.4058 | 500 | 0.5413 |
440
+ | 0.4870 | 600 | 0.5807 |
441
+ | 0.5682 | 700 | 0.506 |
442
+ | 0.6494 | 800 | 0.5132 |
443
+ | 0.7305 | 900 | 0.4641 |
444
+ | 0.8117 | 1000 | 0.456 |
445
+ | 0.8929 | 1100 | 0.4954 |
446
+ | 0.9740 | 1200 | 0.4088 |
447
+
448
+
449
+ ### Framework Versions
450
+ - Python: 3.10.12
451
+ - Sentence Transformers: 3.3.1
452
+ - Transformers: 4.48.3
453
+ - PyTorch: 2.5.1+cu121
454
+ - Accelerate: 1.2.1
455
+ - Datasets: 3.3.2
456
+ - Tokenizers: 0.21.0
457
+
458
+ ## Citation
459
+
460
+ ### BibTeX
461
+
462
+ #### Sentence Transformers
463
+ ```bibtex
464
+ @inproceedings{reimers-2019-sentence-bert,
465
+ title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
466
+ author = "Reimers, Nils and Gurevych, Iryna",
467
+ booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
468
+ month = "11",
469
+ year = "2019",
470
+ publisher = "Association for Computational Linguistics",
471
+ url = "https://arxiv.org/abs/1908.10084",
472
+ }
473
+ ```
474
+
475
+ <!--
476
+ ## Glossary
477
+
478
+ *Clearly define terms in order to be accessible across audiences.*
479
+ -->
480
+
481
+ <!--
482
+ ## Model Card Authors
483
+
484
+ *Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
485
+ -->
486
+
487
+ <!--
488
+ ## Model Card Contact
489
+
490
+ *Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
491
+ -->
config.json ADDED
@@ -0,0 +1,36 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "jxm/cde-small-v2",
3
+ "architecture": "transductive",
4
+ "architectures": [
5
+ "ContextualDocumentEmbeddingTransformer"
6
+ ],
7
+ "attn_implementation": null,
8
+ "auto_map": {
9
+ "AutoConfig": "jxm/cde-small-v2--model.ContextualModelConfig",
10
+ "AutoModel": "jxm/cde-small-v2--model.ContextualDocumentEmbeddingTransformer"
11
+ },
12
+ "autoregressive_backbone": false,
13
+ "cache_dir": null,
14
+ "config_name": null,
15
+ "dataset_backbone": null,
16
+ "disable_dropout": true,
17
+ "disable_transductive_rotary_embedding": true,
18
+ "embedder": "answerdotai/ModernBERT-base",
19
+ "embedder_rerank": "sentence-transformers/gtr-t5-base",
20
+ "embedding_output_dim": null,
21
+ "limit_layers": null,
22
+ "limit_layers_first_stage": null,
23
+ "logit_scale": 50.0,
24
+ "max_seq_length": 512,
25
+ "model_revision": "main",
26
+ "pool_ignore_contextual_tokens": true,
27
+ "pool_ignore_instruction_tokens": true,
28
+ "pooling_strategy": "mean",
29
+ "tokenizer_name": null,
30
+ "torch_dtype": "float32",
31
+ "transductive_corpus_size": 512,
32
+ "transductive_sequence_dropout_prob": 0.0,
33
+ "transductive_tie_token_embeddings": false,
34
+ "transductive_tokens_per_document": 1,
35
+ "transformers_version": "4.48.3"
36
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,13 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.3.1",
4
+ "transformers": "4.48.3",
5
+ "pytorch": "2.5.1+cu121"
6
+ },
7
+ "prompts": {
8
+ "query": "search_query: ",
9
+ "document": "search_document: "
10
+ },
11
+ "default_prompt_name": null,
12
+ "similarity_fn_name": "cosine"
13
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:85f9f1e13c491bd15bef0b2a15f71af13d62617ba86a3ccea7acef3cb50c1489
3
+ size 1222859872
modules.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers_impl.Transformer"
7
+ }
8
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1 @@
 
 
1
+ {}
sentence_transformers_impl.py ADDED
@@ -0,0 +1,155 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import json
4
+ import logging
5
+ import os
6
+ from typing import Any, Optional
7
+
8
+ import torch
9
+ from torch import nn
10
+ from transformers import AutoConfig, AutoModel, AutoTokenizer
11
+
12
+ logger = logging.getLogger(__name__)
13
+
14
+
15
+ class Transformer(nn.Module):
16
+ """Hugging Face AutoModel to generate token embeddings.
17
+ Loads the correct class, e.g. BERT / RoBERTa etc.
18
+ Args:
19
+ model_name_or_path: Hugging Face models name
20
+ (https://huggingface.co/models)
21
+ max_seq_length: Truncate any inputs longer than max_seq_length
22
+ model_args: Keyword arguments passed to the Hugging Face
23
+ Transformers model
24
+ tokenizer_args: Keyword arguments passed to the Hugging Face
25
+ Transformers tokenizer
26
+ config_args: Keyword arguments passed to the Hugging Face
27
+ Transformers config
28
+ cache_dir: Cache dir for Hugging Face Transformers to store/load
29
+ models
30
+ do_lower_case: If true, lowercases the input (independent if the
31
+ model is cased or not)
32
+ tokenizer_name_or_path: Name or path of the tokenizer. When
33
+ None, then model_name_or_path is used
34
+ backend: Backend used for model inference. Can be `torch`, `onnx`,
35
+ or `openvino`. Default is `torch`.
36
+ """
37
+
38
+ save_in_root: bool = True
39
+
40
+ def __init__(
41
+ self,
42
+ model_name_or_path: str,
43
+ model_args: dict[str, Any] | None = None,
44
+ tokenizer_args: dict[str, Any] | None = None,
45
+ config_args: dict[str, Any] | None = None,
46
+ cache_dir: str | None = None,
47
+ **kwargs,
48
+ ) -> None:
49
+ super().__init__()
50
+ if model_args is None:
51
+ model_args = {}
52
+ if tokenizer_args is None:
53
+ tokenizer_args = {}
54
+ if config_args is None:
55
+ config_args = {}
56
+
57
+ if not model_args.get("trust_remote_code", False):
58
+ raise ValueError(
59
+ "You need to set `trust_remote_code=True` to load this model."
60
+ )
61
+
62
+ self.config = AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir)
63
+ self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=self.config, cache_dir=cache_dir, **model_args)
64
+
65
+ self.tokenizer = AutoTokenizer.from_pretrained(
66
+ "answerdotai/ModernBERT-base",
67
+ cache_dir=cache_dir,
68
+ **tokenizer_args,
69
+ )
70
+
71
+ def __repr__(self) -> str:
72
+ return f"Transformer({self.get_config_dict()}) with Transformer model: {self.auto_model.__class__.__name__} "
73
+
74
+ def forward(self, features: dict[str, torch.Tensor], dataset_embeddings: Optional[torch.Tensor] = None, **kwargs) -> dict[str, torch.Tensor]:
75
+ """Returns token_embeddings, cls_token"""
76
+ # If we don't have embeddings, then run the 1st stage model.
77
+ # If we do, then run the 2nd stage model.
78
+ if dataset_embeddings is None:
79
+ sentence_embedding = self.auto_model.first_stage_model(
80
+ input_ids=features["input_ids"],
81
+ attention_mask=features["attention_mask"],
82
+ )
83
+ else:
84
+ sentence_embedding = self.auto_model.second_stage_model(
85
+ input_ids=features["input_ids"],
86
+ attention_mask=features["attention_mask"],
87
+ dataset_embeddings=dataset_embeddings,
88
+ )
89
+
90
+ features["sentence_embedding"] = sentence_embedding
91
+ return features
92
+
93
+ def get_word_embedding_dimension(self) -> int:
94
+ return self.auto_model.config.hidden_size
95
+
96
+ def tokenize(
97
+ self, texts: list[str] | list[dict] | list[tuple[str, str]], padding: str | bool = True
98
+ ) -> dict[str, torch.Tensor]:
99
+ """Tokenizes a text and maps tokens to token-ids"""
100
+ output = {}
101
+ if isinstance(texts[0], str):
102
+ to_tokenize = [texts]
103
+ elif isinstance(texts[0], dict):
104
+ to_tokenize = []
105
+ output["text_keys"] = []
106
+ for lookup in texts:
107
+ text_key, text = next(iter(lookup.items()))
108
+ to_tokenize.append(text)
109
+ output["text_keys"].append(text_key)
110
+ to_tokenize = [to_tokenize]
111
+ else:
112
+ batch1, batch2 = [], []
113
+ for text_tuple in texts:
114
+ batch1.append(text_tuple[0])
115
+ batch2.append(text_tuple[1])
116
+ to_tokenize = [batch1, batch2]
117
+
118
+ max_seq_length = self.config.max_seq_length
119
+ output.update(
120
+ self.tokenizer(
121
+ *to_tokenize,
122
+ padding=padding,
123
+ truncation="longest_first",
124
+ return_tensors="pt",
125
+ max_length=max_seq_length,
126
+ )
127
+ )
128
+ return output
129
+
130
+ def get_config_dict(self) -> dict[str, Any]:
131
+ return {}
132
+
133
+ def save(self, output_path: str, safe_serialization: bool = True) -> None:
134
+ self.auto_model.save_pretrained(output_path, safe_serialization=safe_serialization)
135
+ self.tokenizer.save_pretrained(output_path)
136
+
137
+ with open(os.path.join(output_path, "sentence_bert_config.json"), "w") as fOut:
138
+ json.dump(self.get_config_dict(), fOut, indent=2)
139
+
140
+ @classmethod
141
+ def load(cls, input_path: str) -> Transformer:
142
+ sbert_config_path = os.path.join(input_path, "sentence_bert_config.json")
143
+ if not os.path.exists(sbert_config_path):
144
+ return cls(model_name_or_path=input_path)
145
+
146
+ with open(sbert_config_path) as fIn:
147
+ config = json.load(fIn)
148
+ # Don't allow configs to set trust_remote_code
149
+ if "model_args" in config and "trust_remote_code" in config["model_args"]:
150
+ config["model_args"].pop("trust_remote_code")
151
+ if "tokenizer_args" in config and "trust_remote_code" in config["tokenizer_args"]:
152
+ config["tokenizer_args"].pop("trust_remote_code")
153
+ if "config_args" in config and "trust_remote_code" in config["config_args"]:
154
+ config["config_args"].pop("trust_remote_code")
155
+ return cls(model_name_or_path=input_path, **config)
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": true,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,945 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "|||IP_ADDRESS|||",
5
+ "lstrip": false,
6
+ "normalized": true,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": false
10
+ },
11
+ "1": {
12
+ "content": "<|padding|>",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "50254": {
20
+ "content": " ",
21
+ "lstrip": false,
22
+ "normalized": true,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": false
26
+ },
27
+ "50255": {
28
+ "content": " ",
29
+ "lstrip": false,
30
+ "normalized": true,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": false
34
+ },
35
+ "50256": {
36
+ "content": " ",
37
+ "lstrip": false,
38
+ "normalized": true,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": false
42
+ },
43
+ "50257": {
44
+ "content": " ",
45
+ "lstrip": false,
46
+ "normalized": true,
47
+ "rstrip": false,
48
+ "single_word": false,
49
+ "special": false
50
+ },
51
+ "50258": {
52
+ "content": " ",
53
+ "lstrip": false,
54
+ "normalized": true,
55
+ "rstrip": false,
56
+ "single_word": false,
57
+ "special": false
58
+ },
59
+ "50259": {
60
+ "content": " ",
61
+ "lstrip": false,
62
+ "normalized": true,
63
+ "rstrip": false,
64
+ "single_word": false,
65
+ "special": false
66
+ },
67
+ "50260": {
68
+ "content": " ",
69
+ "lstrip": false,
70
+ "normalized": true,
71
+ "rstrip": false,
72
+ "single_word": false,
73
+ "special": false
74
+ },
75
+ "50261": {
76
+ "content": " ",
77
+ "lstrip": false,
78
+ "normalized": true,
79
+ "rstrip": false,
80
+ "single_word": false,
81
+ "special": false
82
+ },
83
+ "50262": {
84
+ "content": " ",
85
+ "lstrip": false,
86
+ "normalized": true,
87
+ "rstrip": false,
88
+ "single_word": false,
89
+ "special": false
90
+ },
91
+ "50263": {
92
+ "content": " ",
93
+ "lstrip": false,
94
+ "normalized": true,
95
+ "rstrip": false,
96
+ "single_word": false,
97
+ "special": false
98
+ },
99
+ "50264": {
100
+ "content": " ",
101
+ "lstrip": false,
102
+ "normalized": true,
103
+ "rstrip": false,
104
+ "single_word": false,
105
+ "special": false
106
+ },
107
+ "50265": {
108
+ "content": " ",
109
+ "lstrip": false,
110
+ "normalized": true,
111
+ "rstrip": false,
112
+ "single_word": false,
113
+ "special": false
114
+ },
115
+ "50266": {
116
+ "content": " ",
117
+ "lstrip": false,
118
+ "normalized": true,
119
+ "rstrip": false,
120
+ "single_word": false,
121
+ "special": false
122
+ },
123
+ "50267": {
124
+ "content": " ",
125
+ "lstrip": false,
126
+ "normalized": true,
127
+ "rstrip": false,
128
+ "single_word": false,
129
+ "special": false
130
+ },
131
+ "50268": {
132
+ "content": " ",
133
+ "lstrip": false,
134
+ "normalized": true,
135
+ "rstrip": false,
136
+ "single_word": false,
137
+ "special": false
138
+ },
139
+ "50269": {
140
+ "content": " ",
141
+ "lstrip": false,
142
+ "normalized": true,
143
+ "rstrip": false,
144
+ "single_word": false,
145
+ "special": false
146
+ },
147
+ "50270": {
148
+ "content": " ",
149
+ "lstrip": false,
150
+ "normalized": true,
151
+ "rstrip": false,
152
+ "single_word": false,
153
+ "special": false
154
+ },
155
+ "50271": {
156
+ "content": " ",
157
+ "lstrip": false,
158
+ "normalized": true,
159
+ "rstrip": false,
160
+ "single_word": false,
161
+ "special": false
162
+ },
163
+ "50272": {
164
+ "content": " ",
165
+ "lstrip": false,
166
+ "normalized": true,
167
+ "rstrip": false,
168
+ "single_word": false,
169
+ "special": false
170
+ },
171
+ "50273": {
172
+ "content": " ",
173
+ "lstrip": false,
174
+ "normalized": true,
175
+ "rstrip": false,
176
+ "single_word": false,
177
+ "special": false
178
+ },
179
+ "50274": {
180
+ "content": " ",
181
+ "lstrip": false,
182
+ "normalized": true,
183
+ "rstrip": false,
184
+ "single_word": false,
185
+ "special": false
186
+ },
187
+ "50275": {
188
+ "content": " ",
189
+ "lstrip": false,
190
+ "normalized": true,
191
+ "rstrip": false,
192
+ "single_word": false,
193
+ "special": false
194
+ },
195
+ "50276": {
196
+ "content": " ",
197
+ "lstrip": false,
198
+ "normalized": true,
199
+ "rstrip": false,
200
+ "single_word": false,
201
+ "special": false
202
+ },
203
+ "50277": {
204
+ "content": "|||EMAIL_ADDRESS|||",
205
+ "lstrip": false,
206
+ "normalized": true,
207
+ "rstrip": false,
208
+ "single_word": false,
209
+ "special": false
210
+ },
211
+ "50278": {
212
+ "content": "|||PHONE_NUMBER|||",
213
+ "lstrip": false,
214
+ "normalized": true,
215
+ "rstrip": false,
216
+ "single_word": false,
217
+ "special": false
218
+ },
219
+ "50279": {
220
+ "content": "<|endoftext|>",
221
+ "lstrip": false,
222
+ "normalized": false,
223
+ "rstrip": false,
224
+ "single_word": false,
225
+ "special": true
226
+ },
227
+ "50280": {
228
+ "content": "[UNK]",
229
+ "lstrip": false,
230
+ "normalized": false,
231
+ "rstrip": false,
232
+ "single_word": false,
233
+ "special": true
234
+ },
235
+ "50281": {
236
+ "content": "[CLS]",
237
+ "lstrip": false,
238
+ "normalized": false,
239
+ "rstrip": false,
240
+ "single_word": false,
241
+ "special": true
242
+ },
243
+ "50282": {
244
+ "content": "[SEP]",
245
+ "lstrip": false,
246
+ "normalized": false,
247
+ "rstrip": false,
248
+ "single_word": false,
249
+ "special": true
250
+ },
251
+ "50283": {
252
+ "content": "[PAD]",
253
+ "lstrip": false,
254
+ "normalized": false,
255
+ "rstrip": false,
256
+ "single_word": false,
257
+ "special": true
258
+ },
259
+ "50284": {
260
+ "content": "[MASK]",
261
+ "lstrip": true,
262
+ "normalized": false,
263
+ "rstrip": false,
264
+ "single_word": false,
265
+ "special": true
266
+ },
267
+ "50285": {
268
+ "content": "[unused0]",
269
+ "lstrip": false,
270
+ "normalized": true,
271
+ "rstrip": false,
272
+ "single_word": false,
273
+ "special": false
274
+ },
275
+ "50286": {
276
+ "content": "[unused1]",
277
+ "lstrip": false,
278
+ "normalized": true,
279
+ "rstrip": false,
280
+ "single_word": false,
281
+ "special": false
282
+ },
283
+ "50287": {
284
+ "content": "[unused2]",
285
+ "lstrip": false,
286
+ "normalized": true,
287
+ "rstrip": false,
288
+ "single_word": false,
289
+ "special": false
290
+ },
291
+ "50288": {
292
+ "content": "[unused3]",
293
+ "lstrip": false,
294
+ "normalized": true,
295
+ "rstrip": false,
296
+ "single_word": false,
297
+ "special": false
298
+ },
299
+ "50289": {
300
+ "content": "[unused4]",
301
+ "lstrip": false,
302
+ "normalized": true,
303
+ "rstrip": false,
304
+ "single_word": false,
305
+ "special": false
306
+ },
307
+ "50290": {
308
+ "content": "[unused5]",
309
+ "lstrip": false,
310
+ "normalized": true,
311
+ "rstrip": false,
312
+ "single_word": false,
313
+ "special": false
314
+ },
315
+ "50291": {
316
+ "content": "[unused6]",
317
+ "lstrip": false,
318
+ "normalized": true,
319
+ "rstrip": false,
320
+ "single_word": false,
321
+ "special": false
322
+ },
323
+ "50292": {
324
+ "content": "[unused7]",
325
+ "lstrip": false,
326
+ "normalized": true,
327
+ "rstrip": false,
328
+ "single_word": false,
329
+ "special": false
330
+ },
331
+ "50293": {
332
+ "content": "[unused8]",
333
+ "lstrip": false,
334
+ "normalized": true,
335
+ "rstrip": false,
336
+ "single_word": false,
337
+ "special": false
338
+ },
339
+ "50294": {
340
+ "content": "[unused9]",
341
+ "lstrip": false,
342
+ "normalized": true,
343
+ "rstrip": false,
344
+ "single_word": false,
345
+ "special": false
346
+ },
347
+ "50295": {
348
+ "content": "[unused10]",
349
+ "lstrip": false,
350
+ "normalized": true,
351
+ "rstrip": false,
352
+ "single_word": false,
353
+ "special": false
354
+ },
355
+ "50296": {
356
+ "content": "[unused11]",
357
+ "lstrip": false,
358
+ "normalized": true,
359
+ "rstrip": false,
360
+ "single_word": false,
361
+ "special": false
362
+ },
363
+ "50297": {
364
+ "content": "[unused12]",
365
+ "lstrip": false,
366
+ "normalized": true,
367
+ "rstrip": false,
368
+ "single_word": false,
369
+ "special": false
370
+ },
371
+ "50298": {
372
+ "content": "[unused13]",
373
+ "lstrip": false,
374
+ "normalized": true,
375
+ "rstrip": false,
376
+ "single_word": false,
377
+ "special": false
378
+ },
379
+ "50299": {
380
+ "content": "[unused14]",
381
+ "lstrip": false,
382
+ "normalized": true,
383
+ "rstrip": false,
384
+ "single_word": false,
385
+ "special": false
386
+ },
387
+ "50300": {
388
+ "content": "[unused15]",
389
+ "lstrip": false,
390
+ "normalized": true,
391
+ "rstrip": false,
392
+ "single_word": false,
393
+ "special": false
394
+ },
395
+ "50301": {
396
+ "content": "[unused16]",
397
+ "lstrip": false,
398
+ "normalized": true,
399
+ "rstrip": false,
400
+ "single_word": false,
401
+ "special": false
402
+ },
403
+ "50302": {
404
+ "content": "[unused17]",
405
+ "lstrip": false,
406
+ "normalized": true,
407
+ "rstrip": false,
408
+ "single_word": false,
409
+ "special": false
410
+ },
411
+ "50303": {
412
+ "content": "[unused18]",
413
+ "lstrip": false,
414
+ "normalized": true,
415
+ "rstrip": false,
416
+ "single_word": false,
417
+ "special": false
418
+ },
419
+ "50304": {
420
+ "content": "[unused19]",
421
+ "lstrip": false,
422
+ "normalized": true,
423
+ "rstrip": false,
424
+ "single_word": false,
425
+ "special": false
426
+ },
427
+ "50305": {
428
+ "content": "[unused20]",
429
+ "lstrip": false,
430
+ "normalized": true,
431
+ "rstrip": false,
432
+ "single_word": false,
433
+ "special": false
434
+ },
435
+ "50306": {
436
+ "content": "[unused21]",
437
+ "lstrip": false,
438
+ "normalized": true,
439
+ "rstrip": false,
440
+ "single_word": false,
441
+ "special": false
442
+ },
443
+ "50307": {
444
+ "content": "[unused22]",
445
+ "lstrip": false,
446
+ "normalized": true,
447
+ "rstrip": false,
448
+ "single_word": false,
449
+ "special": false
450
+ },
451
+ "50308": {
452
+ "content": "[unused23]",
453
+ "lstrip": false,
454
+ "normalized": true,
455
+ "rstrip": false,
456
+ "single_word": false,
457
+ "special": false
458
+ },
459
+ "50309": {
460
+ "content": "[unused24]",
461
+ "lstrip": false,
462
+ "normalized": true,
463
+ "rstrip": false,
464
+ "single_word": false,
465
+ "special": false
466
+ },
467
+ "50310": {
468
+ "content": "[unused25]",
469
+ "lstrip": false,
470
+ "normalized": true,
471
+ "rstrip": false,
472
+ "single_word": false,
473
+ "special": false
474
+ },
475
+ "50311": {
476
+ "content": "[unused26]",
477
+ "lstrip": false,
478
+ "normalized": true,
479
+ "rstrip": false,
480
+ "single_word": false,
481
+ "special": false
482
+ },
483
+ "50312": {
484
+ "content": "[unused27]",
485
+ "lstrip": false,
486
+ "normalized": true,
487
+ "rstrip": false,
488
+ "single_word": false,
489
+ "special": false
490
+ },
491
+ "50313": {
492
+ "content": "[unused28]",
493
+ "lstrip": false,
494
+ "normalized": true,
495
+ "rstrip": false,
496
+ "single_word": false,
497
+ "special": false
498
+ },
499
+ "50314": {
500
+ "content": "[unused29]",
501
+ "lstrip": false,
502
+ "normalized": true,
503
+ "rstrip": false,
504
+ "single_word": false,
505
+ "special": false
506
+ },
507
+ "50315": {
508
+ "content": "[unused30]",
509
+ "lstrip": false,
510
+ "normalized": true,
511
+ "rstrip": false,
512
+ "single_word": false,
513
+ "special": false
514
+ },
515
+ "50316": {
516
+ "content": "[unused31]",
517
+ "lstrip": false,
518
+ "normalized": true,
519
+ "rstrip": false,
520
+ "single_word": false,
521
+ "special": false
522
+ },
523
+ "50317": {
524
+ "content": "[unused32]",
525
+ "lstrip": false,
526
+ "normalized": true,
527
+ "rstrip": false,
528
+ "single_word": false,
529
+ "special": false
530
+ },
531
+ "50318": {
532
+ "content": "[unused33]",
533
+ "lstrip": false,
534
+ "normalized": true,
535
+ "rstrip": false,
536
+ "single_word": false,
537
+ "special": false
538
+ },
539
+ "50319": {
540
+ "content": "[unused34]",
541
+ "lstrip": false,
542
+ "normalized": true,
543
+ "rstrip": false,
544
+ "single_word": false,
545
+ "special": false
546
+ },
547
+ "50320": {
548
+ "content": "[unused35]",
549
+ "lstrip": false,
550
+ "normalized": true,
551
+ "rstrip": false,
552
+ "single_word": false,
553
+ "special": false
554
+ },
555
+ "50321": {
556
+ "content": "[unused36]",
557
+ "lstrip": false,
558
+ "normalized": true,
559
+ "rstrip": false,
560
+ "single_word": false,
561
+ "special": false
562
+ },
563
+ "50322": {
564
+ "content": "[unused37]",
565
+ "lstrip": false,
566
+ "normalized": true,
567
+ "rstrip": false,
568
+ "single_word": false,
569
+ "special": false
570
+ },
571
+ "50323": {
572
+ "content": "[unused38]",
573
+ "lstrip": false,
574
+ "normalized": true,
575
+ "rstrip": false,
576
+ "single_word": false,
577
+ "special": false
578
+ },
579
+ "50324": {
580
+ "content": "[unused39]",
581
+ "lstrip": false,
582
+ "normalized": true,
583
+ "rstrip": false,
584
+ "single_word": false,
585
+ "special": false
586
+ },
587
+ "50325": {
588
+ "content": "[unused40]",
589
+ "lstrip": false,
590
+ "normalized": true,
591
+ "rstrip": false,
592
+ "single_word": false,
593
+ "special": false
594
+ },
595
+ "50326": {
596
+ "content": "[unused41]",
597
+ "lstrip": false,
598
+ "normalized": true,
599
+ "rstrip": false,
600
+ "single_word": false,
601
+ "special": false
602
+ },
603
+ "50327": {
604
+ "content": "[unused42]",
605
+ "lstrip": false,
606
+ "normalized": true,
607
+ "rstrip": false,
608
+ "single_word": false,
609
+ "special": false
610
+ },
611
+ "50328": {
612
+ "content": "[unused43]",
613
+ "lstrip": false,
614
+ "normalized": true,
615
+ "rstrip": false,
616
+ "single_word": false,
617
+ "special": false
618
+ },
619
+ "50329": {
620
+ "content": "[unused44]",
621
+ "lstrip": false,
622
+ "normalized": true,
623
+ "rstrip": false,
624
+ "single_word": false,
625
+ "special": false
626
+ },
627
+ "50330": {
628
+ "content": "[unused45]",
629
+ "lstrip": false,
630
+ "normalized": true,
631
+ "rstrip": false,
632
+ "single_word": false,
633
+ "special": false
634
+ },
635
+ "50331": {
636
+ "content": "[unused46]",
637
+ "lstrip": false,
638
+ "normalized": true,
639
+ "rstrip": false,
640
+ "single_word": false,
641
+ "special": false
642
+ },
643
+ "50332": {
644
+ "content": "[unused47]",
645
+ "lstrip": false,
646
+ "normalized": true,
647
+ "rstrip": false,
648
+ "single_word": false,
649
+ "special": false
650
+ },
651
+ "50333": {
652
+ "content": "[unused48]",
653
+ "lstrip": false,
654
+ "normalized": true,
655
+ "rstrip": false,
656
+ "single_word": false,
657
+ "special": false
658
+ },
659
+ "50334": {
660
+ "content": "[unused49]",
661
+ "lstrip": false,
662
+ "normalized": true,
663
+ "rstrip": false,
664
+ "single_word": false,
665
+ "special": false
666
+ },
667
+ "50335": {
668
+ "content": "[unused50]",
669
+ "lstrip": false,
670
+ "normalized": true,
671
+ "rstrip": false,
672
+ "single_word": false,
673
+ "special": false
674
+ },
675
+ "50336": {
676
+ "content": "[unused51]",
677
+ "lstrip": false,
678
+ "normalized": true,
679
+ "rstrip": false,
680
+ "single_word": false,
681
+ "special": false
682
+ },
683
+ "50337": {
684
+ "content": "[unused52]",
685
+ "lstrip": false,
686
+ "normalized": true,
687
+ "rstrip": false,
688
+ "single_word": false,
689
+ "special": false
690
+ },
691
+ "50338": {
692
+ "content": "[unused53]",
693
+ "lstrip": false,
694
+ "normalized": true,
695
+ "rstrip": false,
696
+ "single_word": false,
697
+ "special": false
698
+ },
699
+ "50339": {
700
+ "content": "[unused54]",
701
+ "lstrip": false,
702
+ "normalized": true,
703
+ "rstrip": false,
704
+ "single_word": false,
705
+ "special": false
706
+ },
707
+ "50340": {
708
+ "content": "[unused55]",
709
+ "lstrip": false,
710
+ "normalized": true,
711
+ "rstrip": false,
712
+ "single_word": false,
713
+ "special": false
714
+ },
715
+ "50341": {
716
+ "content": "[unused56]",
717
+ "lstrip": false,
718
+ "normalized": true,
719
+ "rstrip": false,
720
+ "single_word": false,
721
+ "special": false
722
+ },
723
+ "50342": {
724
+ "content": "[unused57]",
725
+ "lstrip": false,
726
+ "normalized": true,
727
+ "rstrip": false,
728
+ "single_word": false,
729
+ "special": false
730
+ },
731
+ "50343": {
732
+ "content": "[unused58]",
733
+ "lstrip": false,
734
+ "normalized": true,
735
+ "rstrip": false,
736
+ "single_word": false,
737
+ "special": false
738
+ },
739
+ "50344": {
740
+ "content": "[unused59]",
741
+ "lstrip": false,
742
+ "normalized": true,
743
+ "rstrip": false,
744
+ "single_word": false,
745
+ "special": false
746
+ },
747
+ "50345": {
748
+ "content": "[unused60]",
749
+ "lstrip": false,
750
+ "normalized": true,
751
+ "rstrip": false,
752
+ "single_word": false,
753
+ "special": false
754
+ },
755
+ "50346": {
756
+ "content": "[unused61]",
757
+ "lstrip": false,
758
+ "normalized": true,
759
+ "rstrip": false,
760
+ "single_word": false,
761
+ "special": false
762
+ },
763
+ "50347": {
764
+ "content": "[unused62]",
765
+ "lstrip": false,
766
+ "normalized": true,
767
+ "rstrip": false,
768
+ "single_word": false,
769
+ "special": false
770
+ },
771
+ "50348": {
772
+ "content": "[unused63]",
773
+ "lstrip": false,
774
+ "normalized": true,
775
+ "rstrip": false,
776
+ "single_word": false,
777
+ "special": false
778
+ },
779
+ "50349": {
780
+ "content": "[unused64]",
781
+ "lstrip": false,
782
+ "normalized": true,
783
+ "rstrip": false,
784
+ "single_word": false,
785
+ "special": false
786
+ },
787
+ "50350": {
788
+ "content": "[unused65]",
789
+ "lstrip": false,
790
+ "normalized": true,
791
+ "rstrip": false,
792
+ "single_word": false,
793
+ "special": false
794
+ },
795
+ "50351": {
796
+ "content": "[unused66]",
797
+ "lstrip": false,
798
+ "normalized": true,
799
+ "rstrip": false,
800
+ "single_word": false,
801
+ "special": false
802
+ },
803
+ "50352": {
804
+ "content": "[unused67]",
805
+ "lstrip": false,
806
+ "normalized": true,
807
+ "rstrip": false,
808
+ "single_word": false,
809
+ "special": false
810
+ },
811
+ "50353": {
812
+ "content": "[unused68]",
813
+ "lstrip": false,
814
+ "normalized": true,
815
+ "rstrip": false,
816
+ "single_word": false,
817
+ "special": false
818
+ },
819
+ "50354": {
820
+ "content": "[unused69]",
821
+ "lstrip": false,
822
+ "normalized": true,
823
+ "rstrip": false,
824
+ "single_word": false,
825
+ "special": false
826
+ },
827
+ "50355": {
828
+ "content": "[unused70]",
829
+ "lstrip": false,
830
+ "normalized": true,
831
+ "rstrip": false,
832
+ "single_word": false,
833
+ "special": false
834
+ },
835
+ "50356": {
836
+ "content": "[unused71]",
837
+ "lstrip": false,
838
+ "normalized": true,
839
+ "rstrip": false,
840
+ "single_word": false,
841
+ "special": false
842
+ },
843
+ "50357": {
844
+ "content": "[unused72]",
845
+ "lstrip": false,
846
+ "normalized": true,
847
+ "rstrip": false,
848
+ "single_word": false,
849
+ "special": false
850
+ },
851
+ "50358": {
852
+ "content": "[unused73]",
853
+ "lstrip": false,
854
+ "normalized": true,
855
+ "rstrip": false,
856
+ "single_word": false,
857
+ "special": false
858
+ },
859
+ "50359": {
860
+ "content": "[unused74]",
861
+ "lstrip": false,
862
+ "normalized": true,
863
+ "rstrip": false,
864
+ "single_word": false,
865
+ "special": false
866
+ },
867
+ "50360": {
868
+ "content": "[unused75]",
869
+ "lstrip": false,
870
+ "normalized": true,
871
+ "rstrip": false,
872
+ "single_word": false,
873
+ "special": false
874
+ },
875
+ "50361": {
876
+ "content": "[unused76]",
877
+ "lstrip": false,
878
+ "normalized": true,
879
+ "rstrip": false,
880
+ "single_word": false,
881
+ "special": false
882
+ },
883
+ "50362": {
884
+ "content": "[unused77]",
885
+ "lstrip": false,
886
+ "normalized": true,
887
+ "rstrip": false,
888
+ "single_word": false,
889
+ "special": false
890
+ },
891
+ "50363": {
892
+ "content": "[unused78]",
893
+ "lstrip": false,
894
+ "normalized": true,
895
+ "rstrip": false,
896
+ "single_word": false,
897
+ "special": false
898
+ },
899
+ "50364": {
900
+ "content": "[unused79]",
901
+ "lstrip": false,
902
+ "normalized": true,
903
+ "rstrip": false,
904
+ "single_word": false,
905
+ "special": false
906
+ },
907
+ "50365": {
908
+ "content": "[unused80]",
909
+ "lstrip": false,
910
+ "normalized": true,
911
+ "rstrip": false,
912
+ "single_word": false,
913
+ "special": false
914
+ },
915
+ "50366": {
916
+ "content": "[unused81]",
917
+ "lstrip": false,
918
+ "normalized": true,
919
+ "rstrip": false,
920
+ "single_word": false,
921
+ "special": false
922
+ },
923
+ "50367": {
924
+ "content": "[unused82]",
925
+ "lstrip": false,
926
+ "normalized": true,
927
+ "rstrip": false,
928
+ "single_word": false,
929
+ "special": false
930
+ }
931
+ },
932
+ "clean_up_tokenization_spaces": true,
933
+ "cls_token": "[CLS]",
934
+ "extra_special_tokens": {},
935
+ "mask_token": "[MASK]",
936
+ "model_input_names": [
937
+ "input_ids",
938
+ "attention_mask"
939
+ ],
940
+ "model_max_length": 8192,
941
+ "pad_token": "[PAD]",
942
+ "sep_token": "[SEP]",
943
+ "tokenizer_class": "PreTrainedTokenizerFast",
944
+ "unk_token": "[UNK]"
945
+ }