Add new SentenceTransformer model

Browse files

Files changed (10) hide show

README.md +491 -0
config.json +36 -0
config_sentence_transformers.json +13 -0
model.safetensors +3 -0
modules.json +8 -0
sentence_bert_config.json +1 -0
sentence_transformers_impl.py +155 -0
special_tokens_map.json +37 -0
tokenizer.json +0 -0
tokenizer_config.json +945 -0

README.md ADDED Viewed

	@@ -0,0 +1,491 @@

+---
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- generated_from_trainer
+- dataset_size:29545
+- loss:MultipleNegativesSymmetricRankingLoss
+base_model: jxm/cde-small-v2
+widget:
+- source_sentence: In the context of the risk-based assessment of customers and business
+    relationships, how should the overlap between customer risk assessment and CDD
+    be managed to ensure both are completed effectively and in compliance with ADGM
+    regulations?
+  sentences:
+  - 'DocumentID: 36 | PassageID: D.7. | Passage: Principle 7 – Scenario analysis of
+    climate-related financial risks. Where appropriate, relevant financial firms should
+    develop and implement climate-related scenario analysis frameworks, including
+    stress testing, in a manner commensurate with their size, complexity, risk profile
+    and nature of activities.
+    '
+  - 'DocumentID: 1 | PassageID: 7.Guidance.4. | Passage: The risk-based assessment
+    of the customer and the proposed business relationship, Transaction or product
+    required under this Chapter is required to be undertaken prior to the establishment
+    of a business relationship with a customer. Because the risk rating assigned to
+    a customer resulting from this assessment determines the level of CDD that must
+    be undertaken for that customer, this process must be completed before the CDD
+    is completed for the customer. The Regulator is aware that in practice there will
+    often be some degree of overlap between the customer risk assessment and CDD.
+    For example, a Relevant Person may undertake some aspects of CDD, such as identifying
+    Beneficial Owners, when it performs a risk assessment of the customer. Conversely,
+    a Relevant Person may also obtain relevant information as part of CDD which has
+    an impact on its customer risk assessment. Where information obtained as part
+    of CDD of a customer affects the risk rating of a customer, the change in risk
+    rating should be reflected in the degree of CDD undertaken.'
+  - 'DocumentID: 1 | PassageID: 9.1.2.Guidance.4. | Passage: Where the legislative
+    framework of a jurisdiction (such as secrecy or data protection legislation) prevents
+    a Relevant Person from having access to CDD information upon request without delay
+    as referred to in Rule ‎9.1.1(3)(b), the Relevant Person should undertake the
+    relevant CDD itself and should not seek to rely on the relevant third party.'
+- source_sentence: Can you clarify the responsibilities of the Governing Body of a
+    Relevant Person in establishing and maintaining AML/TFS policies and procedures,
+    and how these should be documented and reviewed?
+  sentences:
+  - 'DocumentID: 28 | PassageID: 193) | Passage: SUPERVISION BY LISTING AUTHORITY
+    Complaints or allegations of non-compliance by Reporting Entities
+    If, as a result of the enquiry, the Listing Authority forms the view that the
+    information is accurate, is Inside Information, and is not within exemption from
+    Disclosure provided by Rule 7.2.2, the Listing Authority will ask the Reporting
+    Entity to make a Disclosure about the matter under Rule 7.2.1.  If the information
+    should have been Disclosed earlier, the Listing Authority may issue an ‘aware
+    letter’ (see paragraphs 187 to 189 above), or take other relevant action.
+    '
+  - "DocumentID: 17 | PassageID: Part 13.165.(2) | Passage: The Regulator shall not\
+    \ approve a Non Abu Dhabi Global Market Clearing House unless it is satisfied—\n\
+    (a)\tthat the rules and practices of the body, together with the law of the country\
+    \ in which the body's head office is situated, provide adequate procedures for\
+    \ dealing with the default of persons party to contracts connected with the body;\
+    \ and\n(b)\tthat it is otherwise appropriate to approve the body;\ntogether being\
+    \ the “Relevant Requirements” for this Part."
+  - "DocumentID: 1 | PassageID: 4.3.1 | Passage: A Relevant Person which is part of\
+    \ a Group must ensure that it:\n(a)\thas developed and implemented policies and\
+    \ procedures for the sharing of information between Group entities, including\
+    \ the sharing of information relating to CDD and money laundering risks;\n(b)\t\
+    has in place adequate safeguards on the confidentiality and use of information\
+    \ exchanged between Group entities, including consideration of relevant data protection\
+    \ legislation;\n(c)\tremains aware of the money laundering risks of the Group\
+    \ as a whole and of its exposure to the Group and takes active steps to mitigate\
+    \ such risks;\n(d)\tcontributes to a Group-wide risk assessment to identify and\
+    \ assess money laundering risks for the Group; and\n(e)\tprovides its Group-wide\
+    \ compliance, audit and AML/TFS functions with customer account and Transaction\
+    \ information from its Branches and Subsidiaries when necessary for AML/TFS purposes."
+- source_sentence: What specific accounting standards and practices are we required
+    to follow when valuing positions in our Trading and Non-Trading Books to ensure
+    compliance with ADGM regulations?
+  sentences:
+  - 'DocumentID: 7 | PassageID: 8.10.1.(2).Guidance.3. | Passage: Each Authorised
+    Person, Recognised Body and its Auditors is also required under Part 16 and section
+    193 of the FSMR respectively, to disclose to the Regulator any matter which may
+    indicate a breach or likely breach of, or a failure or likely failure to comply
+    with, Regulations or Rules. Each Authorised Person and Recognised Body is also
+    required to establish and implement systems and procedures to enable its compliance
+    and compliance by its Auditors with notification requirements.
+    '
+  - "DocumentID: 18 | PassageID: 3.2 | Passage: Financial Services Permissions. VC\
+    \ Managers operating in ADGM require a Financial Services Permission (“FSP”) to\
+    \ undertake any Regulated Activity pertaining to VC Funds and/or co-investments\
+    \ by third parties in VC Funds. The Regulated Activities covered by the FSP will\
+    \ be dependent on the VC Managers’ investment strategy and business model.\n(a)\t\
+    Managing a Collective Investment Fund: this includes carrying out fund management\
+    \ activities in respect of a VC Fund.\n(b)\tAdvising on Investments or Credit\
+    \ : for VC Managers these activities will be restricted to activities related\
+    \ to co-investment alongside a VC Fund which the VC Manager manages, such as recommending\
+    \ that a client invest in an investee company alongside the VC Fund and on the\
+    \ strategy and structure required to make the investment.\n(c)\tArranging Deals\
+    \ in Investments: VC Managers may also wish to make arrangements to facilitate\
+    \ co-investments in the investee company.\nAuthorisation fees and supervision\
+    \ fees for a VC Manager are capped at USD 10,000 regardless of whether one or\
+    \ both of the additional Regulated Activities in b) and c) above in relation to\
+    \ co-investments are included in its FSP. The FSP will include restrictions appropriate\
+    \ to the business model of a VC Manager."
+  - 'DocumentID: 13 | PassageID: APP2.A2.1.1.(4) | Passage: An Authorised Person must
+    value every position included in its Trading Book and the Non Trading Book in
+    accordance with the relevant accounting standards and practices.
+    '
+- source_sentence: What documentation and information are we required to maintain
+    to demonstrate compliance with the rules pertaining to the cooperation with auditors,
+    especially in terms of providing access and not interfering with their duties?
+  sentences:
+  - "DocumentID: 6 | PassageID: PART 5.16.3.5 | Passage: Co-operation with auditors.\
+    \ A Fund Manager must take reasonable steps to ensure that it and its Employees:\n\
+    (a)\tprovide any information to its auditor that its auditor reasonably requires,\
+    \ or is entitled to receive as auditor;\n(b)\tgive the auditor right of access\
+    \ at all reasonable times to relevant records and information within its possession;\n\
+    (c)\tallow the auditor to make copies of any records or information referred to\
+    \ in ‎(b);\n(d)\tdo not interfere with the auditor's ability to discharge its\
+    \ duties;\n(e)\treport to the auditor any matter which may significantly affect\
+    \ the financial position of the Fund; and\n(f)\tprovide such other assistance\
+    \ as the auditor may reasonably request it to provide."
+  - "DocumentID: 13 | PassageID: 4.3.1 | Passage: An Authorised Person must implement\
+    \ and maintain comprehensive Credit Risk management systems which:\n(a)\tare appropriate\
+    \ to the firm's type, scope, complexity and scale of operations;\n(b)\tare appropriate\
+    \ to the diversity of its operations, including geographical diversity;\n(c)\t\
+    enable the firm to effectively identify, assess, monitor and control Credit Risk\
+    \ and to ensure that adequate Capital Resources are available at all times to\
+    \ cover the risks assumed; and\n(d)\tensure effective implementation of the Credit\
+    \ Risk strategy and policy."
+  - 'DocumentID: 3 | PassageID: 3.8.9 | Passage: The Authorised Person acting as the
+    Investment Manager of an ADGM Green Portfolio must provide a copy of the attestation
+    obtained for the purposes of Rule ‎3.8.6 to each Client with whom it has entered
+    into a Discretionary Portfolio Management Agreement in respect of such ADGM Green
+    Portfolio at least on an annual basis and upon request by the Client.'
+- source_sentence: Could you provide examples of circumstances that, when changed,
+    would necessitate the reevaluation of a customer's risk assessment and the application
+    of updated CDD measures?
+  sentences:
+  - 'DocumentID: 13 | PassageID: 9.2.1.Guidance.1. | Passage: The Regulator expects
+    that an Authorised Person''s Liquidity Risk strategy will set out the approach
+    that the Authorised Person will take to Liquidity Risk management, including various
+    quantitative and qualitative targets. It should be communicated to all relevant
+    functions and staff within the organisation and be set out in the Authorised Person''s
+    Liquidity Risk policy.'
+  - "DocumentID: 1 | PassageID: 8.1.2.(1) | Passage: A Relevant Person must also apply\
+    \ CDD measures to each existing customer under Rules ‎8.3.1, ‎8.4.1 or ‎8.5.1\
+    \ as applicable:\n(a)\twith a frequency appropriate to the outcome of the risk-based\
+    \ approach taken in relation to each customer; and\n(b)\twhen the Relevant Person\
+    \ becomes aware that any circumstances relevant to its risk assessment for a customer\
+    \ have changed."
+  - "DocumentID: 1 | PassageID: 8.1.1.Guidance.2. | Passage: The FIU has issued guides\
+    \ that require:\n(a)\ta DNFBP that is a dealer in precious metals or precious\
+    \ stones to obtain relevant identification documents, such as passport, emirates\
+    \ ID, trade licence, as applicable, and register the information via goAML for\
+    \ all cash transactions equal to or exceeding USD15,000 with individuals and all\
+    \ cash or wire transfer transactions equal to or exceeding USD15,000 with entities.\
+    \ The Regulator expects a dealer in any saleable item or a price equal to or greater\
+    \ than USD15,000 to also comply with this requirement;\n(b)\ta DNFBP that is a\
+    \ real estate agent to obtain relevant identification documents, such as passport,\
+    \ emirates ID, trade licence, as applicable, and register the information via\
+    \ goAML for all sales or purchases of Real Property where:\n(i)\tthe payment for\
+    \ the sale/purchase includes a total cash payment of USD15,000 or more whether\
+    \ in a single cash payment or multiple cash payments;\n(ii)\tthe payment for any\
+    \ part or all of the sale/purchase amount includes payment(s) using Virtual Assets;\n\
+    (iii)\tthe payment for any part or all of the sale/purchase amount includes funds\
+    \ that were converted from or to a Virtual Asset."
+pipeline_tag: sentence-similarity
+library_name: sentence-transformers
+---
+# SentenceTransformer based on jxm/cde-small-v2
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2) on the csv dataset. It maps sentences & paragraphs to a None-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [jxm/cde-small-v2](https://huggingface.co/jxm/cde-small-v2) <!-- at revision 287bf0ea6ebfecf2339762d0ef28fb846959a8f2 -->
+- **Maximum Sequence Length:** None tokens
+- **Output Dimensionality:** None dimensions
+- **Similarity Function:** Cosine Similarity
+- **Training Dataset:**
+    - csv
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+### Model Sources
+- **Documentation:** [Sentence Transformers Documentation](https://sbert.net)
+- **Repository:** [Sentence Transformers on GitHub](https://github.com/UKPLab/sentence-transformers)
+- **Hugging Face:** [Sentence Transformers on Hugging Face](https://huggingface.co/models?library=sentence-transformers)
+### Full Model Architecture
+```
+SentenceTransformer(
+  (0): Transformer({}) with Transformer model: ContextualDocumentEmbeddingTransformer
+)
+```
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("jebish7/cde-v2-obliqa-1")
+# Run inference
+sentences = [
+    "Could you provide examples of circumstances that, when changed, would necessitate the reevaluation of a customer's risk assessment and the application of updated CDD measures?",
+    'DocumentID: 1 | PassageID: 8.1.2.(1) | Passage: A Relevant Person must also apply CDD measures to each existing customer under Rules \u200e8.3.1, \u200e8.4.1 or \u200e8.5.1 as applicable:\n(a)\twith a frequency appropriate to the outcome of the risk-based approach taken in relation to each customer; and\n(b)\twhen the Relevant Person becomes aware that any circumstances relevant to its risk assessment for a customer have changed.',
+    "DocumentID: 13 | PassageID: 9.2.1.Guidance.1. | Passage: The Regulator expects that an Authorised Person's Liquidity Risk strategy will set out the approach that the Authorised Person will take to Liquidity Risk management, including various quantitative and qualitative targets. It should be communicated to all relevant functions and staff within the organisation and be set out in the Authorised Person's Liquidity Risk policy.",
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 1024]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+<!--
+## Bias, Risks and Limitations
+*What are the known or foreseeable issues stemming from this model? You could also flag here known failure cases or weaknesses of the model.*
+-->
+<!--
+### Recommendations
+*What are recommendations with respect to the foreseeable issues? For example, filtering explicit content.*
+-->
+## Training Details
+### Training Dataset
+#### csv
+* Dataset: csv
+* Size: 29,545 training samples
+* Columns: <code>anchor</code> and <code>positive</code>
+* Approximate statistics based on the first 1000 samples:
+  |         | anchor                                                                             | positive                                                                             |
+  |:--------|:-----------------------------------------------------------------------------------|:-------------------------------------------------------------------------------------|
+  | type    | string                                                                             | string                                                                               |
+  | details | <ul><li>min: 17 tokens</li><li>mean: 35.21 tokens</li><li>max: 66 tokens</li></ul> | <ul><li>min: 29 tokens</li><li>mean: 143.53 tokens</li><li>max: 512 tokens</li></ul> |
+* Samples:
+  | anchor                                                                                                                                                                              | positive                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
+  |:------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|:---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+  | <code>Could you outline the expected procedures for a Trade Repository to notify relevant authorities of any significant errors or omissions in previously submitted data?</code>   | <code>DocumentID: 7 | PassageID: APP2.A2.1.2 | Passage: Processes and procedures. A Trade Repository must have effective processes and procedures to provide data to relevant authorities in a timely and appropriate manner to enable them to meet their respective regulatory mandates and legal responsibilities.</code>                                                                                                                                                            |
+  | <code>In the context of a non-binding MPO, how are commodities held by an Authorised Person treated for the purpose of determining the Commodities Risk Capital Requirement?</code> | <code>DocumentID: 9 | PassageID: 5.4.13.(a) | Passage: Commodities held by an Authorised Person for selling or leasing when executing a Murabaha, non-binding MPO, Salam or parallel Salam contract must be included in the calculation of its Commodities Risk Capital Requirement.</code>                                                                                                                                                                                            |
+  | <code>Can the FSRA provide case studies or examples of best practices for RIEs operating MTFs or OTFs using spot commodities in line with the Spot Commodities Framework?</code>    | <code>DocumentID: 34 | PassageID: 77) | Passage: REGULATORY REQUIREMENTS - SPOT COMMODITY ACTIVITIES<br>RIEs operating an MTF or OTF using Accepted Spot Commodities<br>This means that an RIE (in addition to operating markets relating to the trading of Financial Instruments) can, where permitted by the FSRA and subject to MIR Rule 3.4.2, operate a separate MTF or OTF under its Recognition Order.  This MTF or OTF may operate using Accepted Spot Commodities.<br></code> |
+* Loss: [<code>MultipleNegativesSymmetricRankingLoss</code>](https://sbert.net/docs/package_reference/sentence_transformer/losses.html#multiplenegativessymmetricrankingloss) with these parameters:
+  ```json
+  {
+      "scale": 20.0,
+      "similarity_fct": "cos_sim"
+  }
+  ```
+### Training Hyperparameters
+#### Non-Default Hyperparameters
+- `per_device_train_batch_size`: 12
+- `num_train_epochs`: 1
+- `warmup_ratio`: 0.1
+- `batch_sampler`: no_duplicates
+#### All Hyperparameters
+<details><summary>Click to expand</summary>
+- `overwrite_output_dir`: False
+- `do_predict`: False
+- `eval_strategy`: no
+- `prediction_loss_only`: True
+- `per_device_train_batch_size`: 12
+- `per_device_eval_batch_size`: 8
+- `per_gpu_train_batch_size`: None
+- `per_gpu_eval_batch_size`: None
+- `gradient_accumulation_steps`: 1
+- `eval_accumulation_steps`: None
+- `torch_empty_cache_steps`: None
+- `learning_rate`: 5e-05
+- `weight_decay`: 0.0
+- `adam_beta1`: 0.9
+- `adam_beta2`: 0.999
+- `adam_epsilon`: 1e-08
+- `max_grad_norm`: 1.0
+- `num_train_epochs`: 1
+- `max_steps`: -1
+- `lr_scheduler_type`: linear
+- `lr_scheduler_kwargs`: {}
+- `warmup_ratio`: 0.1
+- `warmup_steps`: 0
+- `log_level`: passive
+- `log_level_replica`: warning
+- `log_on_each_node`: True
+- `logging_nan_inf_filter`: True
+- `save_safetensors`: True
+- `save_on_each_node`: False
+- `save_only_model`: False
+- `restore_callback_states_from_checkpoint`: False
+- `no_cuda`: False
+- `use_cpu`: False
+- `use_mps_device`: False
+- `seed`: 42
+- `data_seed`: None
+- `jit_mode_eval`: False
+- `use_ipex`: False
+- `bf16`: False
+- `fp16`: False
+- `fp16_opt_level`: O1
+- `half_precision_backend`: auto
+- `bf16_full_eval`: False
+- `fp16_full_eval`: False
+- `tf32`: None
+- `local_rank`: 0
+- `ddp_backend`: None
+- `tpu_num_cores`: None
+- `tpu_metrics_debug`: False
+- `debug`: []
+- `dataloader_drop_last`: False
+- `dataloader_num_workers`: 0
+- `dataloader_prefetch_factor`: None
+- `past_index`: -1
+- `disable_tqdm`: False
+- `remove_unused_columns`: True
+- `label_names`: None
+- `load_best_model_at_end`: False
+- `ignore_data_skip`: False
+- `fsdp`: []
+- `fsdp_min_num_params`: 0
+- `fsdp_config`: {'min_num_params': 0, 'xla': False, 'xla_fsdp_v2': False, 'xla_fsdp_grad_ckpt': False}
+- `fsdp_transformer_layer_cls_to_wrap`: None
+- `accelerator_config`: {'split_batches': False, 'dispatch_batches': None, 'even_batches': True, 'use_seedable_sampler': True, 'non_blocking': False, 'gradient_accumulation_kwargs': None}
+- `deepspeed`: None
+- `label_smoothing_factor`: 0.0
+- `optim`: adamw_torch
+- `optim_args`: None
+- `adafactor`: False
+- `group_by_length`: False
+- `length_column_name`: length
+- `ddp_find_unused_parameters`: None
+- `ddp_bucket_cap_mb`: None
+- `ddp_broadcast_buffers`: False
+- `dataloader_pin_memory`: True
+- `dataloader_persistent_workers`: False
+- `skip_memory_metrics`: True
+- `use_legacy_prediction_loop`: False
+- `push_to_hub`: False
+- `resume_from_checkpoint`: None
+- `hub_model_id`: None
+- `hub_strategy`: every_save
+- `hub_private_repo`: None
+- `hub_always_push`: False
+- `gradient_checkpointing`: False
+- `gradient_checkpointing_kwargs`: None
+- `include_inputs_for_metrics`: False
+- `include_for_metrics`: []
+- `eval_do_concat_batches`: True
+- `fp16_backend`: auto
+- `push_to_hub_model_id`: None
+- `push_to_hub_organization`: None
+- `mp_parameters`:
+- `auto_find_batch_size`: False
+- `full_determinism`: False
+- `torchdynamo`: None
+- `ray_scope`: last
+- `ddp_timeout`: 1800
+- `torch_compile`: False
+- `torch_compile_backend`: None
+- `torch_compile_mode`: None
+- `dispatch_batches`: None
+- `split_batches`: None
+- `include_tokens_per_second`: False
+- `include_num_input_tokens_seen`: False
+- `neftune_noise_alpha`: None
+- `optim_target_modules`: None
+- `batch_eval_metrics`: False
+- `eval_on_start`: False
+- `use_liger_kernel`: False
+- `eval_use_gather_object`: False
+- `average_tokens_across_devices`: False
+- `prompts`: None
+- `batch_sampler`: no_duplicates
+- `multi_dataset_batch_sampler`: proportional
+</details>
+### Training Logs
+| Epoch  | Step | Training Loss |
+|:------:|:----:|:-------------:|
+| 0.0812 | 100  | 1.7126        |
+| 0.1623 | 200  | 0.7412        |
+| 0.2435 | 300  | 0.6673        |
+| 0.3247 | 400  | 0.6119        |
+| 0.4058 | 500  | 0.5413        |
+| 0.4870 | 600  | 0.5807        |
+| 0.5682 | 700  | 0.506         |
+| 0.6494 | 800  | 0.5132        |
+| 0.7305 | 900  | 0.4641        |
+| 0.8117 | 1000 | 0.456         |
+| 0.8929 | 1100 | 0.4954        |
+| 0.9740 | 1200 | 0.4088        |
+### Framework Versions
+- Python: 3.10.12
+- Sentence Transformers: 3.3.1
+- Transformers: 4.48.3
+- PyTorch: 2.5.1+cu121
+- Accelerate: 1.2.1
+- Datasets: 3.3.2
+- Tokenizers: 0.21.0
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+<!--
+## Glossary
+*Clearly define terms in order to be accessible across audiences.*
+-->
+<!--
+## Model Card Authors
+*Lists the people who create the model card, providing recognition and accountability for the detailed work that goes into its construction.*
+-->
+<!--
+## Model Card Contact
+*Provides a way for people who have updates to the Model Card, suggestions, or questions, to contact the Model Card authors.*
+-->

config.json ADDED Viewed

	@@ -0,0 +1,36 @@

+{
+  "_name_or_path": "jxm/cde-small-v2",
+  "architecture": "transductive",
+  "architectures": [
+    "ContextualDocumentEmbeddingTransformer"
+  ],
+  "attn_implementation": null,
+  "auto_map": {
+    "AutoConfig": "jxm/cde-small-v2--model.ContextualModelConfig",
+    "AutoModel": "jxm/cde-small-v2--model.ContextualDocumentEmbeddingTransformer"
+  },
+  "autoregressive_backbone": false,
+  "cache_dir": null,
+  "config_name": null,
+  "dataset_backbone": null,
+  "disable_dropout": true,
+  "disable_transductive_rotary_embedding": true,
+  "embedder": "answerdotai/ModernBERT-base",
+  "embedder_rerank": "sentence-transformers/gtr-t5-base",
+  "embedding_output_dim": null,
+  "limit_layers": null,
+  "limit_layers_first_stage": null,
+  "logit_scale": 50.0,
+  "max_seq_length": 512,
+  "model_revision": "main",
+  "pool_ignore_contextual_tokens": true,
+  "pool_ignore_instruction_tokens": true,
+  "pooling_strategy": "mean",
+  "tokenizer_name": null,
+  "torch_dtype": "float32",
+  "transductive_corpus_size": 512,
+  "transductive_sequence_dropout_prob": 0.0,
+  "transductive_tie_token_embeddings": false,
+  "transductive_tokens_per_document": 1,
+  "transformers_version": "4.48.3"
+}

config_sentence_transformers.json ADDED Viewed

	@@ -0,0 +1,13 @@

+{
+  "__version__": {
+    "sentence_transformers": "3.3.1",
+    "transformers": "4.48.3",
+    "pytorch": "2.5.1+cu121"
+  },
+  "prompts": {
+    "query": "search_query: ",
+    "document": "search_document: "
+  },
+  "default_prompt_name": null,
+  "similarity_fn_name": "cosine"
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:85f9f1e13c491bd15bef0b2a15f71af13d62617ba86a3ccea7acef3cb50c1489
+size 1222859872

modules.json ADDED Viewed

	@@ -0,0 +1,8 @@

+[
+  {
+    "idx": 0,
+    "name": "0",
+    "path": "",
+    "type": "sentence_transformers_impl.Transformer"
+  }
+]

sentence_bert_config.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {}

sentence_transformers_impl.py ADDED Viewed

	@@ -0,0 +1,155 @@

+from __future__ import annotations
+import json
+import logging
+import os
+from typing import Any, Optional
+import torch
+from torch import nn
+from transformers import AutoConfig, AutoModel, AutoTokenizer
+logger = logging.getLogger(__name__)
+class Transformer(nn.Module):
+    """Hugging Face AutoModel to generate token embeddings.
+    Loads the correct class, e.g. BERT / RoBERTa etc.
+    Args:
+        model_name_or_path: Hugging Face models name
+            (https://huggingface.co/models)
+        max_seq_length: Truncate any inputs longer than max_seq_length
+        model_args: Keyword arguments passed to the Hugging Face
+            Transformers model
+        tokenizer_args: Keyword arguments passed to the Hugging Face
+            Transformers tokenizer
+        config_args: Keyword arguments passed to the Hugging Face
+            Transformers config
+        cache_dir: Cache dir for Hugging Face Transformers to store/load
+            models
+        do_lower_case: If true, lowercases the input (independent if the
+            model is cased or not)
+        tokenizer_name_or_path: Name or path of the tokenizer. When
+            None, then model_name_or_path is used
+        backend: Backend used for model inference. Can be `torch`, `onnx`,
+            or `openvino`. Default is `torch`.
+    """
+    save_in_root: bool = True
+    def __init__(
+        self,
+        model_name_or_path: str,
+        model_args: dict[str, Any] | None = None,
+        tokenizer_args: dict[str, Any] | None = None,
+        config_args: dict[str, Any] | None = None,
+        cache_dir: str | None = None,
+        **kwargs,
+    ) -> None:
+        super().__init__()
+        if model_args is None:
+            model_args = {}
+        if tokenizer_args is None:
+            tokenizer_args = {}
+        if config_args is None:
+            config_args = {}
+        if not model_args.get("trust_remote_code", False):
+            raise ValueError(
+                "You need to set `trust_remote_code=True` to load this model."
+            )
+        self.config = AutoConfig.from_pretrained(model_name_or_path, **config_args, cache_dir=cache_dir)
+        self.auto_model = AutoModel.from_pretrained(model_name_or_path, config=self.config, cache_dir=cache_dir, **model_args)
+        self.tokenizer = AutoTokenizer.from_pretrained(
+            "answerdotai/ModernBERT-base",
+            cache_dir=cache_dir,
+            **tokenizer_args,
+        )
+    def __repr__(self) -> str:
+        return f"Transformer({self.get_config_dict()}) with Transformer model: {self.auto_model.__class__.__name__} "
+    def forward(self, features: dict[str, torch.Tensor], dataset_embeddings: Optional[torch.Tensor] = None, **kwargs) -> dict[str, torch.Tensor]:
+        """Returns token_embeddings, cls_token"""
+        # If we don't have embeddings, then run the 1st stage model.
+        # If we do, then run the 2nd stage model.
+        if dataset_embeddings is None:
+            sentence_embedding = self.auto_model.first_stage_model(
+                input_ids=features["input_ids"],
+                attention_mask=features["attention_mask"],
+            )
+        else:
+            sentence_embedding = self.auto_model.second_stage_model(
+                input_ids=features["input_ids"],
+                attention_mask=features["attention_mask"],
+                dataset_embeddings=dataset_embeddings,
+            )
+        features["sentence_embedding"] = sentence_embedding
+        return features
+    def get_word_embedding_dimension(self) -> int:
+        return self.auto_model.config.hidden_size
+    def tokenize(
+        self, texts: list[str] | list[dict] | list[tuple[str, str]], padding: str | bool = True
+    ) -> dict[str, torch.Tensor]:
+        """Tokenizes a text and maps tokens to token-ids"""
+        output = {}
+        if isinstance(texts[0], str):
+            to_tokenize = [texts]
+        elif isinstance(texts[0], dict):
+            to_tokenize = []
+            output["text_keys"] = []
+            for lookup in texts:
+                text_key, text = next(iter(lookup.items()))
+                to_tokenize.append(text)
+                output["text_keys"].append(text_key)
+            to_tokenize = [to_tokenize]
+        else:
+            batch1, batch2 = [], []
+            for text_tuple in texts:
+                batch1.append(text_tuple[0])
+                batch2.append(text_tuple[1])
+            to_tokenize = [batch1, batch2]
+        max_seq_length = self.config.max_seq_length
+        output.update(
+            self.tokenizer(
+                *to_tokenize,
+                padding=padding,
+                truncation="longest_first",
+                return_tensors="pt",
+                max_length=max_seq_length,
+            )
+        )
+        return output
+    def get_config_dict(self) -> dict[str, Any]:
+        return {}
+    def save(self, output_path: str, safe_serialization: bool = True) -> None:
+        self.auto_model.save_pretrained(output_path, safe_serialization=safe_serialization)
+        self.tokenizer.save_pretrained(output_path)
+        with open(os.path.join(output_path, "sentence_bert_config.json"), "w") as fOut:
+            json.dump(self.get_config_dict(), fOut, indent=2)
+    @classmethod
+    def load(cls, input_path: str) -> Transformer:
+        sbert_config_path = os.path.join(input_path, "sentence_bert_config.json")
+        if not os.path.exists(sbert_config_path):
+            return cls(model_name_or_path=input_path)
+        with open(sbert_config_path) as fIn:
+            config = json.load(fIn)
+        # Don't allow configs to set trust_remote_code
+        if "model_args" in config and "trust_remote_code" in config["model_args"]:
+            config["model_args"].pop("trust_remote_code")
+        if "tokenizer_args" in config and "trust_remote_code" in config["tokenizer_args"]:
+            config["tokenizer_args"].pop("trust_remote_code")
+        if "config_args" in config and "trust_remote_code" in config["config_args"]:
+            config["config_args"].pop("trust_remote_code")
+        return cls(model_name_or_path=input_path, **config)

special_tokens_map.json ADDED Viewed

	@@ -0,0 +1,37 @@

+{
+  "cls_token": {
+    "content": "[CLS]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "mask_token": {
+    "content": "[MASK]",
+    "lstrip": true,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "pad_token": {
+    "content": "[PAD]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "sep_token": {
+    "content": "[SEP]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  },
+  "unk_token": {
+    "content": "[UNK]",
+    "lstrip": false,
+    "normalized": false,
+    "rstrip": false,
+    "single_word": false
+  }
+}

tokenizer.json ADDED Viewed

The diff for this file is too large to render. See raw diff

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,945 @@

+{
+  "added_tokens_decoder": {
+    "0": {
+      "content": "|||IP_ADDRESS|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "1": {
+      "content": "<|padding|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50254": {
+      "content": "                        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50255": {
+      "content": "                       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50256": {
+      "content": "                      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50257": {
+      "content": "                     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50258": {
+      "content": "                    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50259": {
+      "content": "                   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50260": {
+      "content": "                  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50261": {
+      "content": "                 ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50262": {
+      "content": "                ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50263": {
+      "content": "               ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50264": {
+      "content": "              ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50265": {
+      "content": "             ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50266": {
+      "content": "            ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50267": {
+      "content": "           ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50268": {
+      "content": "          ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50269": {
+      "content": "         ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50270": {
+      "content": "        ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50271": {
+      "content": "       ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50272": {
+      "content": "      ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50273": {
+      "content": "     ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50274": {
+      "content": "    ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50275": {
+      "content": "   ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50276": {
+      "content": "  ",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50277": {
+      "content": "|||EMAIL_ADDRESS|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50278": {
+      "content": "|||PHONE_NUMBER|||",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50279": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50280": {
+      "content": "[UNK]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50281": {
+      "content": "[CLS]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50282": {
+      "content": "[SEP]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50283": {
+      "content": "[PAD]",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50284": {
+      "content": "[MASK]",
+      "lstrip": true,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "50285": {
+      "content": "[unused0]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50286": {
+      "content": "[unused1]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50287": {
+      "content": "[unused2]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50288": {
+      "content": "[unused3]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50289": {
+      "content": "[unused4]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50290": {
+      "content": "[unused5]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50291": {
+      "content": "[unused6]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50292": {
+      "content": "[unused7]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50293": {
+      "content": "[unused8]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50294": {
+      "content": "[unused9]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50295": {
+      "content": "[unused10]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50296": {
+      "content": "[unused11]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50297": {
+      "content": "[unused12]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50298": {
+      "content": "[unused13]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50299": {
+      "content": "[unused14]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50300": {
+      "content": "[unused15]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50301": {
+      "content": "[unused16]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50302": {
+      "content": "[unused17]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50303": {
+      "content": "[unused18]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50304": {
+      "content": "[unused19]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50305": {
+      "content": "[unused20]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50306": {
+      "content": "[unused21]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50307": {
+      "content": "[unused22]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50308": {
+      "content": "[unused23]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50309": {
+      "content": "[unused24]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50310": {
+      "content": "[unused25]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50311": {
+      "content": "[unused26]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50312": {
+      "content": "[unused27]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50313": {
+      "content": "[unused28]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50314": {
+      "content": "[unused29]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50315": {
+      "content": "[unused30]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50316": {
+      "content": "[unused31]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50317": {
+      "content": "[unused32]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50318": {
+      "content": "[unused33]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50319": {
+      "content": "[unused34]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50320": {
+      "content": "[unused35]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50321": {
+      "content": "[unused36]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50322": {
+      "content": "[unused37]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50323": {
+      "content": "[unused38]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50324": {
+      "content": "[unused39]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50325": {
+      "content": "[unused40]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50326": {
+      "content": "[unused41]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50327": {
+      "content": "[unused42]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50328": {
+      "content": "[unused43]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50329": {
+      "content": "[unused44]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50330": {
+      "content": "[unused45]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50331": {
+      "content": "[unused46]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50332": {
+      "content": "[unused47]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50333": {
+      "content": "[unused48]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50334": {
+      "content": "[unused49]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50335": {
+      "content": "[unused50]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50336": {
+      "content": "[unused51]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50337": {
+      "content": "[unused52]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50338": {
+      "content": "[unused53]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50339": {
+      "content": "[unused54]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50340": {
+      "content": "[unused55]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50341": {
+      "content": "[unused56]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50342": {
+      "content": "[unused57]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50343": {
+      "content": "[unused58]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50344": {
+      "content": "[unused59]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50345": {
+      "content": "[unused60]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50346": {
+      "content": "[unused61]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50347": {
+      "content": "[unused62]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50348": {
+      "content": "[unused63]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50349": {
+      "content": "[unused64]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50350": {
+      "content": "[unused65]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50351": {
+      "content": "[unused66]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50352": {
+      "content": "[unused67]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50353": {
+      "content": "[unused68]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50354": {
+      "content": "[unused69]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50355": {
+      "content": "[unused70]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50356": {
+      "content": "[unused71]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50357": {
+      "content": "[unused72]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50358": {
+      "content": "[unused73]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50359": {
+      "content": "[unused74]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50360": {
+      "content": "[unused75]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50361": {
+      "content": "[unused76]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50362": {
+      "content": "[unused77]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50363": {
+      "content": "[unused78]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50364": {
+      "content": "[unused79]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50365": {
+      "content": "[unused80]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50366": {
+      "content": "[unused81]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "50367": {
+      "content": "[unused82]",
+      "lstrip": false,
+      "normalized": true,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    }
+  },
+  "clean_up_tokenization_spaces": true,
+  "cls_token": "[CLS]",
+  "extra_special_tokens": {},
+  "mask_token": "[MASK]",
+  "model_input_names": [
+    "input_ids",
+    "attention_mask"
+  ],
+  "model_max_length": 8192,
+  "pad_token": "[PAD]",
+  "sep_token": "[SEP]",
+  "tokenizer_class": "PreTrainedTokenizerFast",
+  "unk_token": "[UNK]"
+}