Spaces:

ibm-research
/

SMI-TED-demo1

Running

App Files Files Community

Enzo Reis de Oliveira commited on 20 days ago

Commit

64428bf

1 Parent(s): a3a7416

Adding smi-ted things

Browse files

Files changed (12) hide show

.gitattributes +6 -0
.gitignore +18 -0
README.md +185 -9
app.py +54 -0
config.json +26 -0
install.sh +6 -0
requirements.txt +12 -0
smi-ted/README.md +142 -0
smi-ted/inference/smi_ted_light/__init__.py +0 -0
smi-ted/inference/smi_ted_light/__pycache__/load.cpython-310.pyc +0 -0
smi-ted/inference/smi_ted_light/bert_vocab_curated.txt +2393 -0
smi-ted/inference/smi_ted_light/load.py +642 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,9 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+smi-ted/finetune/moleculenet/qm9/qm9.csv filter=lfs diff=lfs merge=lfs -text
+smi-ted/finetune/moleculenet/qm9/train.csv filter=lfs diff=lfs merge=lfs -text
+smi-ted/images/smi-ted.png filter=lfs diff=lfs merge=lfs -text
+smi-ted/paper/smi_ted_preprint.pdf filter=lfs diff=lfs merge=lfs -text
+smi-ted.png filter=lfs diff=lfs merge=lfs -text
+*.png filter=lfs diff=lfs merge=lfs -text

.gitignore ADDED Viewed

	@@ -0,0 +1,18 @@

+# Model weights
+inference/smi_ted_light/smi-ted-Light_40.pt
+# pyenv
+.python-version
+# Environments
+.env
+./venv
+env/
+venv/
+ENV/
+env.bak/
+venv.bak/
+# editor files
+.vscode/
+.DS_Store

README.md CHANGED Viewed

@@ -1,12 +1,188 @@
 ---
-title: SMI TED Demo1
-emoji: 💻
-colorFrom: blue
-colorTo: pink
-sdk: gradio
-sdk_version: 5.34.2
-app_file: app.py
-pinned: false
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
+license: apache-2.0
+metrics:
+- accuracy
+pipeline_tag: feature-extraction
+tags:
+- chemistry
+- foundation models
+- AI4Science
+- materials
+- molecules
+- safetensors
+- pytorch
+- transformer
+- diffusers
+library_name: transformers
 ---
+# Introduction to IBM's Foundation Models for Materials
+Welcome to IBM's series of large foundation models for sustainable materials. Our models span a variety of representations and modalities, including SMILES, SELFIES, 3D atom positions, 3D density grids, molecular graphs, and other formats. These models are designed to support and advance research in materials science and chemistry.
+GitHub: [GitHub Link](https://github.com/IBM/materials/tree/main)
+Paper: [arXiv:2407.20267](https://arxiv.org/abs/2407.20267)
+# SMILES-based Transformer Encoder-Decoder (SMI-TED)
+![ted-smi](smi-ted.png)
+This repository provides PyTorch source code associated with our publication, "A Large Encoder-Decoder Family of Foundation Models for Chemical Language".
+Paper: [Arxiv Link](https://github.com/IBM/materials/blob/main/smi-ted/paper/smi-ted_preprint.pdf)
+We provide the model weights in two formats:
+- PyTorch (`.pt`): [smi-ted-Light_40.pt](smi-ted-Light_40.pt)
+- safetensors (`.safetensors`): [model_weights.safetensors](model_weights.safetensors)
+For more information contact: [email protected] or [email protected].
+## Introduction
+We present a large encoder-decoder chemical foundation model, SMILES-based Transformer Encoder-Decoder (SMI-TED), pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants (289M and 8X289M). Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. For more information contact: [email protected] or [email protected].
+## Table of Contents
+1. [Getting Started](#getting-started)
+    1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
+    2. [Replicating Conda Environment](#replicating-conda-environment)
+2. [Pretraining](#pretraining)
+3. [Finetuning](#finetuning)
+4. [Feature Extraction](#feature-extraction)
+5. [Citations](#citations)
+## Getting Started
+**This code and environment have been tested on Nvidia V100s and Nvidia A100s**
+### Pretrained Models and Training Logs
+We provide checkpoints of the SMI-TED model pre-trained on a dataset of ~91M molecules curated from PubChem. The pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet.
+Add the SMI-TED `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
+```
+inference/
+├── smi_ted_light
+│   ├── smi_ted_light.pt
+│   ├── bert_vocab_curated.txt
+│   └── load.py
+```
+and/or:
+```
+finetune/
+├── smi_ted_light
+│   ├── smi_ted_light.pt
+│   ├── bert_vocab_curated.txt
+│   └── load.py
+```
+### Replicating Conda Environment
+Follow these steps to replicate our Conda environment and install the necessary libraries:
+#### Create and Activate Conda Environment
+```
+conda create --name smi-ted-env python=3.10
+conda activate smi-ted-env
+```
+#### Install Packages with Conda
+```
+conda install pytorch=2.1.0 pytorch-cuda=11.8 -c pytorch -c nvidia
+```
+#### Install Packages with Pip
+```
+pip install -r requirements.txt
+pip install pytorch-fast-transformers
+```
+## Pretraining
+For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder-decoder strategy to refine SMILES reconstruction and improve the generated latent space.
+SMI-TED is pre-trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:
+- Compounds are filtered to a maximum length of 202 tokens during preprocessing.
+- A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
+- A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.
+The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.
+To pre-train the two variants of the SMI-TED model, run:
+```
+bash training/run_model_light_training.sh
+```
+or
+```
+bash training/run_model_large_training.sh
+```
+Use `train_model_D.py` to train only the decoder or `train_model_ED.py` to train both the encoder and decoder.
+## Finetuning
+The finetuning datasets and environment can be found in the [finetune](https://github.com/IBM/materials/tree/main/smi-ted/finetune) directory. After setting up the environment, you can run a finetuning task with:
+```
+bash finetune/smi_ted_light/esol/run_finetune_esol.sh
+```
+Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
+## Feature Extraction
+The example notebook [smi_ted_encoder_decoder_example.ipynb](https://github.com/IBM/materials/blob/main/smi-ted/notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks.
+To load smi-ted, you can simply use:
+```python
+model = load_smi_ted(
+    folder='../inference/smi_ted_light',
+    ckpt_filename='smi_ted_light.pt'
+)
+```
+or
+```python
+with open('model_weights.bin', 'rb') as f:
+    state_dict = torch.load(f)
+    model.load_state_dict(state_dict)
+)
+```
+To encode SMILES into embeddings, you can use:
+```python
+with torch.no_grad():
+    encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
+```
+For decoder, you can use the function, so you can return from embeddings to SMILES strings:
+```python
+with torch.no_grad():
+    decoded_smiles = model.decode(encoded_embeddings)
+```
+## Citations
+```
+@misc{soares2024largeencoderdecoderfamilyfoundation,
+      title={A Large Encoder-Decoder Family of Foundation Models For Chemical Language},
+      author={Eduardo Soares and Victor Shirasuna and Emilio Vital Brazil and Renato Cerqueira and Dmitry Zubarev and Kristin Schmidt},
+      year={2024},
+      eprint={2407.20267},
+      archivePrefix={arXiv},
+      primaryClass={cs.LG},
+      url={https://arxiv.org/abs/2407.20267},
+}
+```

app.py ADDED Viewed

	@@ -0,0 +1,54 @@

+import os, sys
+BASE_DIR = os.path.dirname(__file__)
+INFERENCE_DIR = os.path.join(BASE_DIR, "smi-ted", "inference")
+sys.path.append(INFERENCE_DIR)
+import gradio as gr
+from smi_ted_light.load import load_smi_ted
+# 2) Caminho onde estão pesos e vocabulário
+MODEL_DIR = os.path.join("smi-ted", "inference", "smi_ted_light")
+# 3) Carrega o modelo SMI‑TED (Light)
+#    Se você renomeou o .pt ou o vocab, ajuste aqui.
+model = load_smi_ted(
+    folder=MODEL_DIR,
+    ckpt_filename="smi-ted-Light_40.pt",
+    vocab_filename="bert_vocab_curated.txt",
+)
+# 4) Função utilizada pela interface
+def gerar_embedding(smiles: str):
+    """
+    Recebe uma string SMILES e devolve o embedding (lista de 768 floats).
+    Em caso de erro, devolve um dicionário com a mensagem.
+    """
+    smiles = smiles.strip()
+    if not smiles:
+        return {"erro": "digite uma sequência SMILES primeiro"}
+    try:
+        # model.encode devolve tensor shape (1, 768) quando return_torch=True
+        vetor_torch = model.encode(smiles, return_torch=True)[0]
+        return vetor_torch.tolist()  # JSON‑serializável
+    except Exception as e:
+        return {"erro": str(e)}
+# 5) Define a interface Gradio
+demo = gr.Interface(
+    fn=gerar_embedding,
+    inputs=gr.Textbox(label="SMILES", placeholder="Ex.: CCO"),
+    outputs=gr.JSON(label="Embedding (lista de floats)"),
+    title="SMI‑TED Embedding Generator",
+    description=(
+        "Cole uma sequência SMILES e receba o embedding gerado pelo modelo "
+        "SMI‑TED Light treinado pela IBM Research."
+    ),
+)
+# 6) Roda localmente ou no Hugging Face Space
+if __name__ == "__main__":
+    demo.launch()

config.json ADDED Viewed

	@@ -0,0 +1,26 @@

+{
+    "n_batch": 32,
+    "n_layer": 12,
+    "n_head": 12,
+    "n_embd": 768,
+    "max_len": 202,
+    "d_dropout": 0.1,
+    "dropout": 0.1,
+    "lr_start": 3e-5,
+    "lr_multiplier": 1,
+    "model_type" : "SMI-TED",
+    "max_epochs": 500,
+    "num_feats": 32,
+    "smi_ted_version": "v1",
+    "model_path": "../",
+    "ckpt_filename": "smi-ted-Light_40.pt",
+    "data_root": "../../moleculenet/esol",
+    "dataset_name": "esol",
+    "measure_name": "measured log solubility in mols per litre",
+    "checkpoints_folder": "./checkpoints_esol",
+    "loss_fn": "rmse",
+    "target_metric": "rmse",
+    "save_ckpt": 1,
+    "start_seed": 0,
+    "train_decoder": 1
+}

install.sh ADDED Viewed

	@@ -0,0 +1,6 @@

+pip install torch==2.1.0
+pip install pytorch-fast-transformers==0.4.0
+pip install -r requirements.txt

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+wheel
+    torch>=2.1.0
+    transformers>=4.40.0
+    pytorch-fast-transformers==0.4.0
+    regex
+    numpy==1.26.4
+    pandas==1.4.0
+    tqdm>=4.66.4
+    rdkit>=2024.3.5
+    gradio>=4.32.0
+    huggingface-hub

smi-ted/README.md ADDED Viewed

	@@ -0,0 +1,142 @@

+# SMILES-based Transformer Encoder-Decoder (SMI-TED)
+This repository provides PyTorch source code associated with our publication, "A Large Encoder-Decoder Family of Foundation Models for Chemical Language".
+Paper: [Arxiv Link](paper/smi_ted_preprint.pdf)
+For model weights contact: [email protected] or [email protected] .
+![ted-smi](images/smi-ted.png)
+## Introduction
+We present a large encoder-decoder chemical foundation model, SMILES-based Transformer Encoder-Decoder (SMI-TED), pre-trained on a curated dataset of 91 million SMILES samples sourced from PubChem, equivalent to 4 billion molecular tokens. SMI-TED supports various complex tasks, including quantum property prediction, with two main variants ($289M$ and $8 \times 289M$). Our experiments across multiple benchmark datasets demonstrate state-of-the-art performance for various tasks. For model weights contact: [email protected] or [email protected] .
+## Table of Contents
+1. [Getting Started](#getting-started)
+    1. [Pretrained Models and Training Logs](#pretrained-models-and-training-logs)
+    2. [Replicating Conda Environment](#replicating-conda-environment)
+2. [Pretraining](#pretraining)
+3. [Finetuning](#finetuning)
+4. [Feature Extraction](#feature-extraction)
+5. [Citations](#citations)
+## Getting Started
+**This code and environment have been tested on Nvidia V100s and Nvidia A100s**
+### Pretrained Models and Training Logs
+We provide checkpoints of the SMI-TED model pre-trained on a dataset of ~91M molecules curated from PubChem. The pre-trained model shows competitive performance on classification and regression benchmarks from MoleculeNet. For model weights contact: [email protected] or [email protected] .
+Add the SMI-TED `pre-trained weights.pt` to the `inference/` or `finetune/` directory according to your needs. The directory structure should look like the following:
+```
+inference/
+├── smi_ted_light
+│   ├── smi_ted_light.pt
+│   ├── bert_vocab_curated.txt
+│   └── load.py
+```
+and/or:
+```
+finetune/
+├── smi_ted_light
+│   ├── smi_ted_light.pt
+│   ├── bert_vocab_curated.txt
+│   └── load.py
+```
+### Replicating Conda Environment
+Follow these steps to replicate our Conda environment and install the necessary libraries:
+#### Create and Activate Conda Environment
+```
+conda create --name smi-ted-env python=3.8.18
+conda activate smi-ted-env
+```
+#### Install Packages with Conda
+```
+conda install pytorch=1.13.1 cudatoolkit=11.4 -c pytorch
+conda install numpy=1.23.5 pandas=2.0.3
+conda install rdkit=2021.03.5 -c conda-forge
+```
+#### Install Packages with Pip
+```
+pip install transformers==4.6.0 pytorch-fast-transformers==0.4.0 torch-optimizer==0.3.0 datasets==1.6.2 scikit-learn==1.3.2 scipy==1.12.0 tqdm==4.66.1
+```
+## Pretraining
+For pretraining, we use two strategies: the masked language model method to train the encoder part and an encoder-decoder strategy to refine SMILES reconstruction and improve the generated latent space.
+SMI-TED is pre-trained on canonicalized and curated 91M SMILES from PubChem with the following constraints:
+- Compounds are filtered to a maximum length of 202 tokens during preprocessing.
+- A 95/5/0 split is used for encoder training, with 5% of the data for decoder pretraining.
+- A 100/0/0 split is also used to train the encoder and decoder directly, enhancing model performance.
+The pretraining code provides examples of data processing and model training on a smaller dataset, requiring 8 A100 GPUs.
+To pre-train the two variants of the SMI-TED model, run:
+```
+bash training/run_model_light_training.sh
+```
+or
+```
+bash training/run_model_large_training.sh
+```
+Use `train_model_D.py` to train only the decoder or `train_model_ED.py` to train both the encoder and decoder.
+## Finetuning
+The finetuning datasets and environment can be found in the [finetune](finetune/) directory. After setting up the environment, you can run a finetuning task with:
+```
+bash finetune/smi_ted_light/esol/run_finetune_esol.sh
+```
+Finetuning training/checkpointing resources will be available in directories named `checkpoint_<measure_name>`.
+## Feature Extraction
+The example notebook [smi_ted_encoder_decoder_example.ipynb](notebooks/smi_ted_encoder_decoder_example.ipynb) contains code to load checkpoint files and use the pre-trained model for encoder and decoder tasks. It also includes examples of classification and regression tasks. For model weights contact: [email protected] or [email protected].
+To load smi-ted, you can simply use:
+```python
+model = load_smi_ted(
+    folder='../inference/smi_ted_light',
+    ckpt_filename='smi_ted_light.pt'
+)
+```
+To encode SMILES into embeddings, you can use:
+```python
+with torch.no_grad():
+    encoded_embeddings = model.encode(df['SMILES'], return_torch=True)
+```
+For decoder, you can use the function, so you can return from embeddings to SMILES strings:
+```python
+with torch.no_grad():
+    decoded_smiles = model.decode(encoded_embeddings)
+```
+## Citations
+```
+to include
+```

smi-ted/inference/smi_ted_light/__init__.py ADDED Viewed

File without changes

smi-ted/inference/smi_ted_light/__pycache__/load.cpython-310.pyc ADDED Viewed

Binary file (20.6 kB). View file

smi-ted/inference/smi_ted_light/bert_vocab_curated.txt ADDED Viewed

	@@ -0,0 +1,2393 @@

+<bos>
+<eos>
+<pad>
+<mask>
+C
+c
+(
+)
+1
+O
+N
+2
+=
+n
+3
+[C@H]
+[C@@H]
+F
+S
+4
+Cl
+-
+o
+s
+[nH]
+#
+/
+Br
+[C@]
+[C@@]
+[N+]
+[O-]
+5
+\
+.
+I
+6
+[S@]
+[S@@]
+P
+[N-]
+[Si]
+7
+[n+]
+[2H]
+8
+[NH+]
+B
+9
+[C-]
+[Na+]
+[Cl-]
+[c-]
+[CH]
+%10
+[NH2+]
+[P+]
+[B]
+[I-]
+%11
+[CH2-]
+[O+]
+[NH3+]
+[C]
+[Br-]
+[IH2]
+[S-]
+[cH-]
+%12
+[nH+]
+[B-]
+[K+]
+[Sn]
+[Se]
+[CH-]
+[HH]
+[Y]
+[n-]
+[CH3-]
+[SiH]
+[S+]
+%13
+[SiH2]
+[Li+]
+[NH-]
+%14
+[Na]
+[CH2]
+[O-2]
+[U+2]
+[W]
+[Al]
+[P@]
+[Fe+2]
+[PH+]
+%15
+[Cl+3]
+[Zn+2]
+[Ir]
+[Mg+2]
+[Pt+2]
+[OH2+]
+[As]
+[Fe]
+[OH+]
+[Zr+2]
+[3H]
+[Ge]
+[SiH3]
+[OH-]
+[NH4+]
+[Cu+2]
+[P@@]
+p
+[Pt]
+%16
+[Ca+2]
+[Zr]
+[F-]
+[C+]
+[Ti]
+[P-]
+[V]
+[se]
+[U]
+[O]
+[Ni+2]
+[Zn]
+[Co]
+[Ni]
+[Pd+2]
+[Cu]
+%17
+[Cu+]
+[Te]
+[H+]
+[CH+]
+[Li]
+[Pd]
+[Mo]
+[Ru+2]
+[o+]
+[Re]
+[SH+]
+%18
+[Ac]
+[Cr]
+[NH2-]
+[K]
+[13CH2]
+[c]
+[Zr+4]
+[Tl]
+[13C]
+[Mn]
+[N@+]
+[Hg]
+[Rh]
+[Ti+4]
+[Sb]
+[Co+2]
+[Ag+]
+[Ru]
+%19
+[N@@+]
+[Ti+2]
+[Al+3]
+[Pb]
+[I+]
+[18F]
+[s+]
+[Rb+]
+[Ba+2]
+[H-]
+[Fe+3]
+[Ir+3]
+[13cH]
+%20
+[AlH2]
+[Au+]
+[13c]
+[SH2+]
+[Sn+2]
+[Mn+2]
+[Si-]
+[Ag]
+[N]
+[Bi]
+%21
+[In]
+[CH2+]
+[Y+3]
+[Ga]
+%22
+[Co+3]
+[Au]
+[13CH3]
+[Mg]
+[Cs+]
+[W+2]
+[Hf]
+[Zn+]
+[Se-]
+[S-2]
+[Ca]
+[pH]
+[ClH+]
+[Ti+3]
+%23
+[Ru+]
+[SH-]
+[13CH]
+[IH+]
+[Hf+4]
+[Rf]
+[OH3+]
+%24
+[Pt+4]
+[Zr+3]
+[PH3+]
+[Sr+2]
+[Cd+2]
+[Cd]
+%25
+[Os]
+[BH-]
+[Sn+4]
+[Cr+3]
+[Ru+3]
+[PH2+]
+[Rh+2]
+[V+2]
+%26
+[Gd+3]
+[Pb+2]
+[PH]
+[Hg+]
+[Mo+2]
+[AlH]
+[Sn+]
+%27
+[Pd+]
+b
+[Rh+3]
+[Hg+2]
+[15NH]
+[14C]
+%28
+[Mn+3]
+[Si+]
+[SeH]
+[13C@H]
+[NH]
+[Ga+3]
+[SiH-]
+[13C@@H]
+[Ce]
+[Au+3]
+[Bi+3]
+[15N]
+%29
+[BH3-]
+[14cH]
+[Ti+]
+[Gd]
+[cH+]
+[Cr+2]
+[Sb-]
+%30
+[Be+2]
+[Al+]
+[te]
+[11CH3]
+[Sm]
+[Pr]
+[La]
+%31
+[Al-]
+[Ta]
+[125I]
+[BH2-]
+[Nb]
+[Si@]
+%32
+[14c]
+[Sb+3]
+[Ba]
+%33
+[Os+2]
+[Si@@]
+[La+3]
+[15n]
+[15NH2]
+[Nd+3]
+%34
+[14CH2]
+[18O]
+[Nd]
+[GeH]
+[Ni+3]
+[Eu]
+[Dy+3]
+[Sc]
+%36
+[Se-2]
+[As+]
+%35
+[AsH]
+[Tb]
+[Sb+5]
+[Se+]
+[Ce+3]
+[c+]
+[In+3]
+[SnH]
+[Mo+4]
+%37
+[V+4]
+[Eu+3]
+[Hf+2]
+%38
+[Pt+]
+[p+]
+[123I]
+[Tl+]
+[Sm+3]
+%39
+[Yb+3]
+%40
+[Yb]
+[Os+]
+%41
+[10B]
+[Sc+3]
+[Al+2]
+%42
+[Sr]
+[Tb+3]
+[Po]
+[Tc]
+[PH-]
+[AlH3]
+[Ar]
+[U+4]
+[SnH2]
+[Cl+2]
+[si]
+[Fe+]
+[14CH3]
+[U+3]
+[Cl+]
+%43
+[GeH2]
+%44
+[Er+3]
+[Mo+3]
+[I+2]
+[Fe+4]
+[99Tc]
+%45
+[11C]
+%46
+[SnH3]
+[S]
+[Te+]
+[Er]
+[Lu+3]
+[11B]
+%47
+%48
+[P]
+[Tm]
+[Th]
+[Dy]
+[Pr+3]
+[Ta+5]
+[Nb+5]
+[Rb]
+[GeH3]
+[Br+2]
+%49
+[131I]
+[Fm]
+[Cs]
+[BH4-]
+[Lu]
+[15nH]
+%50
+[Ru+6]
+[b-]
+[Ho]
+[Th+4]
+[Ru+4]
+%52
+[14CH]
+%51
+[Cr+6]
+[18OH]
+[Ho+3]
+[Ce+4]
+[Bi+2]
+[Co+]
+%53
+[Yb+2]
+[Fe+6]
+[Be]
+%54
+[SH3+]
+[Np]
+[As-]
+%55
+[14C@@H]
+[Ir+2]
+[GaH3]
+[p-]
+[GeH4]
+[Sn+3]
+[Os+4]
+%56
+[14C@H]
+[sH+]
+[19F]
+[Eu+2]
+[TlH]
+%57
+[Cr+4]
+%58
+[B@@-]
+[SiH+]
+[At]
+[Am]
+[Fe+5]
+[AsH2]
+[Si+4]
+[B@-]
+[Pu]
+[SbH]
+[P-2]
+[Tm+3]
+*
+%59
+[se+]
+[IH-]
+%60
+[oH+]
+[1H]
+[15N+]
+[124I]
+[S@@+]
+[P-3]
+[H]
+[IH2+]
+[TeH]
+[Xe]
+[PH4+]
+[Cr+]
+[Cm]
+[I+3]
+%61
+[Nb+2]
+[Ru+5]
+%62
+[Ta+2]
+[Tc+4]
+[CH3+]
+[Pm]
+[Si@H]
+[No]
+%63
+[Cr+5]
+[Th+2]
+[Zn-2]
+[13C@]
+[Lr]
+%64
+[99Tc+3]
+%65
+[13C@@]
+%66
+[Fe-]
+[17O]
+[siH]
+[Sb+]
+[OH]
+[IH]
+[11CH2]
+[Cf]
+[SiH2+]
+[Gd+2]
+[In+]
+[Si@@H]
+[Mn+]
+[99Tc+4]
+[Ga-]
+%67
+[S@+]
+[Ge+4]
+[Tl+3]
+[16OH]
+%68
+[2H-]
+[Ra]
+[si-]
+[NiH2]
+[P@@H]
+[Rh+]
+[12C]
+[35S]
+[32P]
+[SiH2-]
+[AlH2+]
+[16O]
+%69
+[BiH]
+[BiH2]
+[Zn-]
+[BH]
+[Tc+3]
+[Ir+]
+[Ni+]
+%70
+[InH2]
+[InH]
+[Nb+3]
+[PbH]
+[Bi+]
+%71
+[As+3]
+%72
+[18O-]
+[68Ga+3]
+%73
+[Pa]
+[76Br]
+[Tc+5]
+[pH+]
+[64Cu+2]
+[Ru+8]
+%74
+[PH2-]
+[Si+2]
+[17OH]
+[RuH]
+[111In+3]
+[AlH+]
+%75
+%76
+[W+]
+[SbH2]
+[PoH]
+[Ru-]
+[XeH]
+[Tc+2]
+[13C-]
+[Br+]
+[Pt-2]
+[Es]
+[Cu-]
+[Mg+]
+[3HH]
+[P@H]
+[ClH2+]
+%77
+[SH]
+[Au-]
+[2HH]
+%78
+[Sn-]
+[11CH]
+[PdH2]
+0
+[Os+6]
+%79
+[Mo+]
+%80
+[al]
+[PbH2]
+[64Cu]
+[Cl]
+[12CH3]
+%81
+[Tc+7]
+[11c]
+%82
+[Li-]
+[99Tc+5]
+[He]
+[12c]
+[Kr]
+[RuH+2]
+[35Cl]
+[Pd-2]
+[GaH2]
+[4H]
+[Sg]
+[Cu-2]
+[Br+3]
+%83
+[37Cl]
+[211At]
+[IrH+2]
+[Mt]
+[Ir-2]
+[In-]
+[12cH]
+[12CH2]
+[RuH2]
+[99Tc+7]
+%84
+[15n+]
+[ClH2+2]
+[16N]
+[111In]
+[Tc+]
+[Ru-2]
+[12CH]
+[si+]
+[Tc+6]
+%85
+%86
+[90Y]
+[Pd-]
+[188Re]
+[RuH+]
+[NiH]
+[SiH3-]
+[14n]
+[CH3]
+[14N]
+[10BH2]
+%88
+%89
+%90
+[34S]
+[77Br]
+[GaH]
+[Br]
+[Ge@]
+[B@@H-]
+[CuH]
+[SiH4]
+[3H-]
+%87
+%91
+%92
+[67Cu]
+[I]
+[177Lu]
+[ReH]
+[67Ga+3]
+[Db]
+[177Lu+3]
+[AlH2-]
+[Si+3]
+[Ti-2]
+[RuH+3]
+[al+]
+[68Ga]
+[2H+]
+[B@H-]
+[WH2]
+[OsH]
+[Ir-3]
+[AlH-]
+[Bk]
+[75Se]
+[14C@]
+[Pt-]
+[N@@H+]
+[Nb-]
+[13NH2]
+%93
+[186Re]
+[Tb+4]
+[PtH]
+[IrH2]
+[Hg-2]
+[AlH3-]
+[PdH+]
+[Md]
+[RhH+2]
+[11cH]
+[Co-2]
+[15N-]
+[ZrH2]
+%94
+[Hg-]
+[127I]
+[AsH2+]
+[MoH2]
+[Te+4]
+[14C@@]
+[As+5]
+[SnH+3]
+[Ge@@]
+[6Li+]
+[WH]
+[Ne]
+[14NH2]
+[14NH]
+[12C@@H]
+[Os+7]
+[RhH]
+[Al-3]
+[SnH+]
+[15NH3+]
+[Zr+]
+[197Hg+]
+%95
+%96
+[90Y+3]
+[Os-2]
+[98Tc+5]
+[15NH3]
+[bH-]
+[33P]
+[Zr-2]
+[15O]
+[Rh-]
+[PbH3]
+[PH2]
+[Ni-]
+[CuH+]
+%97
+%98
+%99
+[Os+5]
+[PtH+]
+[ReH4]
+[16NH]
+[82Br]
+[W-]
+[18F-]
+[15NH4+]
+[Se+4]
+[SeH-]
+[SH4]
+[67Cu+2]
+[12C@H]
+[AsH3]
+[HgH]
+[10B-]
+[99Tc+6]
+[117Sn+4]
+[Te@]
+[P@+]
+[35SH]
+[SeH+]
+[Ni-2]
+[Al-2]
+[TeH2]
+[Bh]
+[99Tc+2]
+[Os+8]
+[PH-2]
+[7Li+]
+[14nH]
+[AlH+2]
+[18FH]
+[SnH4]
+[18O-2]
+[IrH]
+[13N]
+[Te@@]
+[Rh-3]
+[15NH+]
+[AsH3+]
+[SeH2]
+[AsH+]
+[CoH2]
+[16NH2]
+[AsH-]
+[203Hg+]
+[P@@+]
+[166Ho+3]
+[60Co+3]
+[13CH2-]
+[SeH2+]
+[75Br]
+[TlH2]
+[80Br]
+[siH+]
+[Ca+]
+[153Sm+3]
+[PdH]
+[225Ac]
+[13CH3-]
+[AlH4-]
+[FeH]
+[13CH-]
+[14C-]
+[11C-]
+[153Sm]
+[Re-]
+[te+]
+[13CH4]
+[ClH+2]
+[8CH2]
+[99Mo]
+[ClH3+3]
+[SbH3]
+[25Mg+2]
+[16N+]
+[SnH2+]
+[PH4]
+[11C@H]
+[122I]
+[Re-2]
+[RuH2+2]
+[ZrH]
+[Bi-]
+[Pr+]
+[Rn]
+[Fr]
+[36Cl]
+[18o]
+[YH]
+[79Br]
+[121I]
+[113In+3]
+[InH4-]
+[TaH]
+[RhH2]
+[Ta-]
+[67Ga]
+[ZnH+]
+[SnH2-]
+[OsH2]
+[16F]
+[FeH2]
+[14O]
+[PbH2+2]
+[BH2]
+[6H]
+[125Te]
+[197Hg]
+[TaH2]
+[TaH3]
+[76As]
+[Nb-2]
+[14N+]
+[125I-]
+[33S]
+[IH2+2]
+[NH2]
+[PtH2]
+[MnH]
+[19C]
+[17F]
+[1H-]
+[SnH4+2]
+[Mn-2]
+[15NH2+]
+[TiH2]
+[ReH7]
+[Cd-2]
+[Fe-3]
+[SH2]
+[17O-]
+[siH-]
+[CoH+]
+[VH]
+[10BH]
+[Ru-3]
+[13O]
+[5H]
+[CoH]
+[PH5]
+[15n-]
+[153Gd]
+[12C@]
+[11CH3-]
+[IrH3]
+[RuH3]
+[74Se]
+[Se@]
+[Hf+]
+[77Se]
+[166Ho]
+[59Fe+2]
+[203Hg]
+[18OH-]
+[8CH]
+[12C@@]
+[11CH4]
+[15C]
+[249Cf]
+[PbH4]
+[64Zn]
+[PH3]
+[99Tc+]
+[14c-]
+[149Pm]
+[IrH4]
+[Se@@]
+[13OH]
+[14CH3-]
+[28Si]
+[Rh-2]
+[Fe-2]
+[131I-]
+[51Cr]
+[62Cu+2]
+[81Br]
+[121Sb]
+[7Li]
+[89Zr+4]
+[SbH3+]
+[11C@@H]
+[98Tc]
+[59Fe+3]
+[BiH2+]
+[SbH+]
+[TiH]
+[14NH3]
+[15OH]
+[119Sn]
+[201Hg]
+[MnH+]
+[201Tl]
+[51Cr+3]
+[123I-]
+[MoH]
+[AlH6-3]
+[MnH2]
+[WH3]
+[213Bi+3]
+[SnH2+2]
+[123IH]
+[13CH+]
+[Zr-]
+[74As]
+[13C+]
+[32P+]
+[KrH]
+[SiH+2]
+[ClH3+2]
+[13NH]
+[9CH2]
+[ZrH2+2]
+[87Sr+2]
+[35s]
+[239Pu]
+[198Au]
+[241Am]
+[203Hg+2]
+[V+]
+[YH2]
+[SH5]
+[195Pt]
+[203Pb]
+[RuH4]
+[ThH2]
+[AuH]
+[66Ga+3]
+[11B-]
+[F]
+[24Na+]
+[85Sr+2]
+[201Tl+]
+[14CH4]
+[32S]
+[TeH2+]
+[ClH2+3]
+[AgH]
+[Ge@H]
+[44Ca+2]
+[Os-]
+[31P]
+[15nH+]
+[SbH4]
+[TiH+]
+[Ba+]
+[57Co+2]
+[Ta+]
+[125IH]
+[77As]
+[129I]
+[Fe-4]
+[Ta-2]
+[19O]
+[12O]
+[BiH3]
+[237Np]
+[252Cf]
+[86Y]
+[Cr-2]
+[89Y]
+[195Pt+2]
+[si+2]
+[58Fe+2]
+[Hs]
+[S@@H]
+[OsH6]
+[GdH2]
+[IH3]
+[8CH4]
+[164Dy+3]
+[47Ca+2]
+[57Co]
+[NbH2]
+[ReH2]
+[ZnH2]
+[CrH2]
+[17NH]
+[ZrH3]
+[RhH3]
+[12C-]
+[18O+]
+[Bi-2]
+[ClH4+3]
+[Ni-3]
+[Ag-]
+[111In-]
+[Mo-2]
+[55Fe+3]
+[204Hg+]
+[35Cl-]
+[211Pb]
+[75Ge]
+[8B]
+[TeH3]
+[SnH3+]
+[Zr-3]
+[28F]
+[249Bk]
+[169Yb]
+[34SH]
+[6Li]
+[94Tc]
+[197Au]
+[195Pt+4]
+[169Yb+3]
+[32Cl]
+[82Se]
+[159Gd+3]
+[213Bi]
+[CoH+2]
+[36S]
+[35P]
+[Ru-4]
+[Cr-3]
+[60Co]
+[1H+]
+[18CH2]
+[Cd-]
+[152Sm+3]
+[106Ru]
+[238Pu]
+[220Rn]
+[45Ca+2]
+[89Sr+2]
+[239Np]
+[90Sr+2]
+[137Cs+]
+[165Dy]
+[68GaH3]
+[65Zn+2]
+[89Zr]
+[BiH2+2]
+[62Cu]
+[165Dy+3]
+[238U]
+[105Rh+3]
+[70Zn]
+[12B]
+[12OH]
+[18CH]
+[17CH]
+[OsH3]
+[SbH-]
+[SH6]
+[AlH2-2]
+[42K]
+[76Br-]
+[71As]
+[NbH3]
+[ReH3]
+[OsH-]
+[WH4]
+[MoH3]
+[OsH4]
+[RuH6]
+[PtH3]
+[CuH2]
+[CoH3]
+[TiH4]
+[64Zn+2]
+[Si-2]
+[79BrH]
+[14CH2-]
+[PtH2+2]
+[Os-3]
+[29Si]
+[Ti-]
+[Se+6]
+[22Na+]
+[42K+]
+[131Cs+]
+[86Rb+]
+[134Cs+]
+[209Po]
+[208Po]
+[81Rb+]
+[203Tl+]
+[Zr-4]
+[148Sm]
+[147Sm]
+[37Cl-]
+[12CH4]
+[Ge@@H]
+[63Cu]
+[13CH2+]
+[AsH2-]
+[CeH]
+[SnH-]
+[UH]
+[9c]
+[21CH3]
+[TeH+]
+[57Co+3]
+[8BH2]
+[12BH2]
+[19BH2]
+[9BH2]
+[YbH2]
+[CrH+2]
+[208Bi]
+[152Gd]
+[61Cu]
+[115In]
+[60Co+2]
+[13NH2-]
+[120I]
+[18OH2]
+[75SeH]
+[SbH2+]
+[144Ce]
+[16n]
+[113In]
+[22nH]
+[129I-]
+[InH3]
+[32PH3]
+[234U]
+[235U]
+[59Fe]
+[82Rb+]
+[65Zn]
+[244Cm]
+[147Pm]
+[91Y]
+[237Pu]
+[231Pa]
+[253Cf]
+[127Te]
+[187Re]
+[236Np]
+[235Np]
+[72Zn]
+[253Es]
+[159Dy]
+[62Zn]
+[101Tc]
+[149Tb]
+[124I-]
+[SeH3+]
+[210Pb]
+[40K]
+[210Po]
+[214Pb]
+[218Po]
+[214Po]
+[7Be]
+[212Pb]
+[205Pb]
+[209Pb]
+[123Te]
+[202Pb]
+[72As]
+[201Pb]
+[70As]
+[73Ge]
+[200Pb]
+[198Pb]
+[66Ga]
+[73Se]
+[195Pb]
+[199Pb]
+[144Ce+3]
+[235U+2]
+[90Tc]
+[114In+3]
+[128I]
+[100Tc+]
+[82Br-]
+[191Pt+2]
+[191Pt+4]
+[193Pt+4]
+[31PH3]
+[125I+2]
+[131I+2]
+[125Te+4]
+[82Sr+2]
+[149Sm]
+[81BrH]
+[129Xe]
+[193Pt+2]
+[123I+2]
+[Cr-]
+[Co-]
+[227Th+4]
+[249Cf+3]
+[252Cf+3]
+[187Os]
+[16O-]
+[17O+]
+[16OH-]
+[98Tc+7]
+[58Co+2]
+[69Ga+3]
+[57Fe+2]
+[43K+]
+[16C]
+[52Fe+3]
+[SeH5]
+[194Pb]
+[196Pb]
+[197Pb]
+[213Pb]
+[9B]
+[19B]
+[11CH-]
+[9CH]
+[20OH]
+[25OH]
+[8cH]
+[TiH+3]
+[SnH6+3]
+[N@H+]
+[ZnH]
+[VH3]
+[52Mn+2]
+[64Ga]
+[13B]
+[216Bi]
+[117Sn+2]
+[232Th]
+[SnH+2]
+[BiH5]
+[77Kr]
+[103Cd]
+[62Ni]
+[LaH3]
+[SmH3]
+[EuH3]
+[MoH5]
+[64Ni]
+[66Zn]
+[68Zn]
+[186W]
+[FeH4]
+[MoH4]
+[HgH2]
+[15NH2-]
+[UH2]
+[204Hg]
+[GaH4-]
+[ThH4]
+[WH6]
+[PtH4]
+[VH2]
+[UH3]
+[FeH3]
+[RuH5]
+[BiH4]
+[80Br-]
+[CeH3]
+[37ClH]
+[157Gd+3]
+[205Tl]
+[203Tl]
+[62Cu+]
+[64Cu+]
+[61Cu+]
+[37SH2]
+[30Si]
+[28Al]
+[19OH2]
+[8He]
+[6He]
+[153Pm]
+[209Bi]
+[66Zn+2]
+[10CH4]
+[191Ir]
+[66Cu]
+[16O+]
+[25O]
+[10c]
+[Co-3]
+[Sn@@]
+[17OH-]
+[206Po]
+[204Po]
+[202Po]
+[201Po]
+[200Po]
+[199Po]
+[198Po]
+[197Po]
+[196Po]
+[195Po]
+[194Po]
+[193Po]
+[192Po]
+[191Po]
+[190Po]
+[217Po]
+[BiH4-]
+[TeH4]
+[222Ra]
+[62Ga]
+[39Ar]
+[144Sm]
+[58Fe]
+[153Eu]
+[85Rb]
+[171Yb]
+[172Yb]
+[114Cd]
+[51Fe]
+[142Ce]
+[207Tl]
+[92Mo]
+[115Sn]
+[140Ce]
+[202Hg]
+[180W]
+[182W]
+[183W]
+[184W]
+[96Mo]
+[47Ti]
+[111Cd]
+[143Nd]
+[145Nd]
+[126Te]
+[128Te]
+[130Te]
+[185Re]
+[97Mo]
+[98Mo]
+[183Re]
+[52V]
+[80Se]
+[87Kr]
+[137Xe]
+[196Au]
+[146Ce]
+[88Kr]
+[51Ti]
+[138Xe]
+[112Cd]
+[116Sn]
+[120Sn]
+[28SiH3]
+[35S-]
+[15NH-]
+[13CH3+]
+[34S+]
+[34s]
+[SiH4-]
+[100Tc+5]
+[NiH2+2]
+[239Th]
+[186Lu]
+[AuH3]
+[I@@-]
+[XeH2]
+[B+]
+[16CH2]
+[8C]
+[TaH5]
+[FeH4-]
+[19C@H]
+[10NH]
+[FeH6-3]
+[22CH]
+[25N]
+[25N+]
+[25N-]
+[21CH2]
+[18cH]
+[113I]
+[ScH3]
+[30PH3]
+[43Ca+2]
+[41Ca+2]
+[106Cd]
+[122Sn]
+[18CH3]
+[58Co+3]
+[98Tc+4]
+[70Ge]
+[76Ge]
+[108Cd]
+[116Cd]
+[130Xe]
+[94Mo]
+[124Sn]
+[186Os]
+[188Os]
+[190Os]
+[192Os]
+[106Pd]
+[110Pd]
+[120Te]
+[132Ba]
+[134Ba]
+[136Ba]
+[136Ce]
+[138Ce]
+[156Dy]
+[158Dy]
+[160Dy]
+[163Dy]
+[162Er]
+[164Er]
+[167Er]
+[176Hf]
+[26Mg]
+[144Nd]
+[150Nd]
+[41K]
+[46Ti]
+[48Ti]
+[49Ti]
+[50Ti]
+[170Yb]
+[173Yb]
+[91Zr]
+[92Zr]
+[96Zr]
+[34S-]
+[CuH2-]
+[38Cl]
+[25Mg]
+[51V]
+[93Nb]
+[95Mo]
+[45Sc]
+[123Sb]
+[139La]
+[9Be]
+[99Y+3]
+[99Y]
+[156Ho]
+[67Zn]
+[144Ce+4]
+[210Tl]
+[42Ca]
+[54Fe]
+[193Ir]
+[92Nb]
+[141Cs]
+[52Cr]
+[35ClH]
+[46Ca]
+[139Cs]
+[65Cu]
+[71Ga]
+[60Ni]
+[16NH3]
+[148Nd]
+[72Ge]
+[161Dy]
+[49Ca]
+[43Ca]
+[8Be]
+[48Ca]
+[44Ca]
+[120Xe]
+[80Rb]
+[215At]
+[180Re]
+[146Sm]
+[19Ne]
+[74Kr]
+[134La]
+[76Kr]
+[219Fr]
+[121Xe]
+[220Fr]
+[216At]
+[223Ac]
+[218At]
+[37Ar]
+[135I]
+[110Cd]
+[94Tc+7]
+[86Y+3]
+[135I-]
+[15O-2]
+[151Eu+3]
+[161Tb+3]
+[197Hg+2]
+[109Cd+2]
+[191Os+4]
+[170Tm+3]
+[205Bi+3]
+[233U+4]
+[126Sb+3]
+[127Sb+3]
+[132Cs+]
+[136Eu+3]
+[136Eu]
+[125Sn+4]
+[175Yb+3]
+[100Mo]
+[22Ne]
+[13c-]
+[13NH4+]
+[17C]
+[9C]
+[31S]
+[31SH]
+[133I]
+[126I]
+[36SH]
+[30S]
+[32SH]
+[19CH2]
+[19c]
+[18c]
+[15F]
+[10C]
+[RuH-]
+[62Zn+2]
+[32ClH]
+[33ClH]
+[78BrH]
+[12Li+]
+[12Li]
+[233Ra]
+[68Ge+4]
+[44Sc+3]
+[91Y+3]
+[106Ru+3]
+[PoH2]
+[AtH]
+[55Fe]
+[233U]
+[210PoH2]
+[230Th]
+[228Th]
+[222Rn]
+[35SH2]
+[227Th]
+[192Ir]
+[133Xe]
+[81Kr]
+[95Zr]
+[240Pu]
+[54Mn]
+[103Ru]
+[95Nb]
+[109Cd]
+[141Ce]
+[85Kr]
+[110Ag]
+[58Co]
+[241Pu]
+[234Th]
+[140La]
+[63Ni]
+[152Eu]
+[132IH]
+[226Rn]
+[154Eu]
+[36ClH]
+[228Ac]
+[155Eu]
+[106Rh]
+[243Am]
+[227Ac]
+[243Cm]
+[236U]
+[144Pr]
+[232U]
+[32SH2]
+[88Y]
+[82BrH]
+[135IH]
+[242Cm]
+[115Cd]
+[242Pu]
+[46Sc]
+[56Mn]
+[234Pa]
+[41Ar]
+[147Nd]
+[187W]
+[151Sm]
+[59Ni]
+[233Pa]
+[52Mn]
+[94Nb]
+[219Rn]
+[236Pu]
+[13NH3]
+[93Zr]
+[51Cr+6]
+[TlH3]
+[123Xe]
+[160Tb]
+[170Tm]
+[182Ta]
+[175Yb]
+[93Mo]
+[143Ce]
+[191Os]
+[126IH]
+[48V]
+[113Cd]
+[47Sc]
+[181Hf]
+[185W]
+[143Pr]
+[191Pt]
+[181W]
+[33PH3]
+[97Ru]
+[97Tc]
+[111Ag]
+[169Er]
+[107Pd]
+[103Ru+2]
+[34SH2]
+[137Ce]
+[242Am]
+[117SnH2]
+[57Ni]
+[239U]
+[60Cu]
+[250Cf]
+[193Au]
+[69Zn]
+[55Co]
+[139Ce]
+[127Xe]
+[159Gd]
+[56Co]
+[177Hf]
+[244Pu]
+[38ClH]
+[142Pr]
+[199Hg]
+[179Hf]
+[178Hf]
+[237U]
+[156Eu]
+[157Eu]
+[105Ru]
+[171Tm]
+[199Au]
+[155Sm]
+[80BrH]
+[108Ag]
+[128IH]
+[48Sc]
+[45Ti]
+[176Lu]
+[121SnH2]
+[148Pm]
+[57Fe]
+[10BH3]
+[96Tc]
+[133IH]
+[143Pm]
+[105Rh]
+[130IH]
+[134IH]
+[131IH]
+[71Zn]
+[105Ag]
+[97Zr]
+[235Pu]
+[231Th]
+[109Pd]
+[93Y]
+[190Ir]
+[135Xe]
+[53Mn]
+[134Ce]
+[234Np]
+[240Am]
+[246Cf]
+[240Cm]
+[241Cm]
+[226Th]
+[39ClH]
+[229Th]
+[245Cm]
+[240U]
+[240Np]
+[249Cm]
+[243Pu]
+[145Pm]
+[199Pt]
+[246Bk]
+[193Pt]
+[230U]
+[250Cm]
+[44Ti]
+[175Hf]
+[254Fm]
+[255Fm]
+[257Fm]
+[92Y]
+[188Ir]
+[171Lu]
+[257Md]
+[247Bk]
+[121IH]
+[250Bk]
+[179Lu]
+[224Ac]
+[195Hg]
+[244Am]
+[246Pu]
+[194Au]
+[252Fm]
+[173Hf]
+[246Cm]
+[135Ce]
+[49Cr]
+[248Cf]
+[247Cm]
+[248Cm]
+[174Ta]
+[176Ta]
+[154Tb]
+[172Ta]
+[177Ta]
+[175Ta]
+[180Ta]
+[158Tb]
+[115Ag]
+[189Os]
+[251Cf]
+[145Pr]
+[147Pr]
+[76BrH]
+[102Rh]
+[238Np]
+[185Os]
+[246Am]
+[233Np]
+[166Dy]
+[254Es]
+[244Cf]
+[193Os]
+[245Am]
+[245Bk]
+[239Am]
+[238Am]
+[97Nb]
+[245Pu]
+[254Cf]
+[188W]
+[250Es]
+[251Es]
+[237Am]
+[182Hf]
+[258Md]
+[232Np]
+[238Cm]
+[60Fe]
+[109Pd+2]
+[234Pu]
+[141Ce+3]
+[136Nd]
+[136Pr]
+[173Ta]
+[110Ru]
+[147Tb]
+[253Fm]
+[139Nd]
+[178Re]
+[177Re]
+[200Au]
+[182Re]
+[156Tb]
+[155Tb]
+[157Tb]
+[161Tb]
+[161Ho]
+[167Tm]
+[173Lu]
+[179Ta]
+[171Er]
+[44Sc]
+[49Sc]
+[49V]
+[51Mn]
+[90Nb]
+[88Nb]
+[88Zr]
+[36SH2]
+[174Yb]
+[178Lu]
+[179W]
+[83BrH]
+[107Cd]
+[75BrH]
+[62Co]
+[48Cr]
+[63Zn]
+[102Ag]
+[154Sm]
+[168Er]
+[65Ni]
+[137La]
+[187Ir]
+[144Pm]
+[146Pm]
+[160Gd]
+[166Yb]
+[162Dy]
+[47V]
+[141Nd]
+[141Sm]
+[166Er]
+[150Sm]
+[146Eu]
+[149Eu]
+[174Lu]
+[17NH3]
+[102Ru]
+[170Hf]
+[188Pt]
+[61Ni]
+[56Ni]
+[149Gd]
+[151Gd]
+[141Pm]
+[147Gd]
+[146Gd]
+[161Er]
+[103Ag]
+[145Eu]
+[153Tb]
+[155Dy]
+[184Re]
+[180Os]
+[182Os]
+[186Pt]
+[181Os]
+[181Re]
+[151Tb]
+[178Ta]
+[178W]
+[189Pt]
+[194Hg]
+[145Sm]
+[150Tb]
+[132La]
+[158Gd]
+[104Ag]
+[193Hg]
+[94Ru]
+[137Pr]
+[155Ho]
+[117Cd]
+[99Ru]
+[146Nd]
+[218Rn]
+[95Y]
+[79Kr]
+[120IH]
+[138Pr]
+[100Pd]
+[166Tm]
+[90Mo]
+[151Nd]
+[231U]
+[138Nd]
+[89Nb]
+[98Nb]
+[162Ho]
+[142Sm]
+[186Ta]
+[104Tc]
+[184Ta]
+[185Ta]
+[170Er]
+[107Rh]
+[131La]
+[169Lu]
+[74BrH]
+[150Pm]
+[172Tm]
+[197Pt]
+[230Pu]
+[170Lu]
+[86Zr]
+[176W]
+[177W]
+[101Pd]
+[105Pd]
+[108Pd]
+[149Nd]
+[164Ho]
+[159Ho]
+[167Ho]
+[176Yb]
+[156Sm]
+[77BrH]
+[189Re]
+[99Rh]
+[100Rh]
+[151Pm]
+[232Pa]
+[228Pa]
+[230Pa]
+[66Ni]
+[194Os]
+[135La]
+[138La]
+[141La]
+[142La]
+[195Ir]
+[96Nb]
+[157Ho]
+[183Hf]
+[162Tm]
+[172Er]
+[148Eu]
+[150Eu]
+[15CH4]
+[89Kr]
+[143La]
+[58Ni]
+[61Co]
+[158Eu]
+[165Er]
+[167Yb]
+[173Tm]
+[175Tm]
+[172Hf]
+[172Lu]
+[93Tc]
+[177Yb]
+[124IH]
+[194Ir]
+[147Eu]
+[101Mo]
+[180Hf]
+[189Ir]
+[87Y]
+[43Sc]
+[195Au]
+[112Ag]
+[84BrH]
+[106Ag]
+[109Ag]
+[101Rh]
+[162Yb]
+[228Rn]
+[139Pr]
+[94Y]
+[201Au]
+[40PH3]
+[110Ag+]
+[104Cd]
+[133Ba+2]
+[226Ac]
+[145Gd]
+[186Ir]
+[184Ir]
+[224Rn]
+[185Ir]
+[182Ir]
+[184Hf]
+[200Pt]
+[227Pa]
+[178Yb]
+[72Br-]
+[72BrH]
+[248Am]
+[238Th]
+[161Gd]
+[35S-2]
+[107Ag]
+[FeH6-4]
+[89Sr]
+[SnH3-]
+[SeH3]
+[TeH3+]
+[SbH4+]
+[AsH4+]
+[4He]
+[AsH3-]
+[1HH]
+[3H+]
+[82Rb]
+[85Sr]
+[90Sr]
+[137Cs]
+[133Ba]
+[131Cs]
+[SbH5]
+[224Ra]
+[22Na]
+[210Bi]
+[214Bi]
+[228Ra]
+[127Sb]
+[136Cs]
+[125Sb]
+[134Cs]
+[140Ba]
+[45Ca]
+[206Pb]
+[207Pb]
+[24Na]
+[86Rb]
+[212Bi]
+[208Pb]
+[124Sb]
+[204Pb]
+[44K]
+[129Te]
+[113Sn]
+[204Tl]
+[87Sr]
+[208Tl]
+[87Rb]
+[47Ca]
+[135Cs]
+[216Po]
+[137Ba]
+[207Bi]
+[212Po]
+[79Se]
+[223Ra]
+[86Sr]
+[122Sb]
+[26Al]
+[32Si]
+[126Sn]
+[225Ra]
+[114In]
+[72Ga]
+[132Te]
+[10Be]
+[125Sn]
+[73As]
+[206Bi]
+[117Sn]
+[40Ca]
+[41Ca]
+[89Rb]
+[116In]
+[129Sb]
+[91Sr]
+[71Ge]
+[139Ba]
+[69Ga]
+[120Sb]
+[121Sn]
+[123Sn]
+[131Te]
+[77Ge]
+[135Ba]
+[82Sr]
+[43K]
+[131Ba]
+[92Sr]
+[88Rb]
+[129Cs]
+[144Cs]
+[127Cs]
+[200Tl]
+[202Tl]
+[141Ba]
+[117Sb]
+[116Sb]
+[78As]
+[131Sb]
+[126Sb]
+[128Sb]
+[130Sb]
+[67Ge]
+[68Ge]
+[78Ge]
+[66Ge]
+[223Fr]
+[132Cs]
+[125Cs]
+[138Cs]
+[133Te]
+[84Rb]
+[83Rb]
+[81Rb]
+[142Ba]
+[200Bi]
+[115Sb]
+[194Tl]
+[70Se]
+[112In]
+[118Sb]
+[70Ga]
+[27Mg]
+[202Bi]
+[83Se]
+[9Li]
+[69As]
+[79Rb]
+[81Sr]
+[83Sr]
+[78Se]
+[109In]
+[29Al]
+[118Sn]
+[117In]
+[119Sb]
+[114Sn]
+[138Ba]
+[69Ge]
+[73Ga]
+[74Ge]
+[206Tl]
+[199Tl]
+[130Cs]
+[28Mg]
+[116Te]
+[112Sn]
+[126Ba]
+[211Bi]
+[81Se]
+[127Sn]
+[143Cs]
+[134Te]
+[80Sr]
+[45K]
+[215Po]
+[207Po]
+[111Sn]
+[211Po]
+[128Ba]
+[198Tl]
+[227Ra]
+[213Po]
+[220Ra]
+[128Sn]
+[203Po]
+[205Po]
+[65Ga]
+[197Tl]
+[88Sr]
+[110In]
+[31Si]
+[201Bi]
+[121Te]
+[205Bi]
+[203Bi]
+[195Tl]
+[209Tl]
+[110Sn]
+[222Fr]
+[207At]
+[119In]
+[As@]
+[129IH]
+[157Dy]
+[111IH]
+[230Ra]
+[144Pr+3]
+[SiH3+]
+[3He]
+[AsH5]
+[72Se]
+[95Tc]
+[103Pd]
+[121Sn+2]
+[211Rn]
+[38SH2]
+[127IH]
+[74Br-]
+[133I-]
+[100Tc+4]
+[100Tc]
+[36Cl-]
+[89Y+3]
+[104Rh]
+[152Sm]
+[226Ra]
+[19FH]
+[104Pd]
+[148Gd]
+[157Lu]
+[33SH2]
+[121I-]
+[17FH]
+[71Se]
+[157Sm]
+[148Tb]
+[164Dy]
+[15OH2]
+[15O+]
+[39K]
+[40Ar]
+[50Cr+3]
+[50Cr]
+[52Ti]
+[103Pd+2]
+[130Ba]
+[142Pm]
+[153Gd+3]
+[151Eu]
+[103Rh]
+[124Xe]
+[152Tb]
+[17OH2]
+[20Ne]
+[52Fe]
+[94Zr+4]
+[94Zr]
+[149Pr]
+[16OH2]
+[53Cr+6]
+[53Cr]
+[81Br-]
+[112Pd]
+[125Xe]
+[155Gd]
+[157Gd]
+[168Yb]
+[184Os]
+[166Tb]
+[221Fr]
+[212Ra]
+[75Br-]
+[79Br-]
+[113Ag]
+[23Na]
+[34Cl-]
+[34ClH]
+[38Cl-]
+[56Fe]
+[68Cu]
+[77Br-]
+[90Zr+4]
+[90Zr]
+[102Pd]
+[154Eu+3]
+[57Mn]
+[165Tm]
+[152Dy]
+[217At]
+[77se]
+[13cH-]
+[122Te]
+[156Gd]
+[124Te]
+[53Ni]
+[131Xe]
+[174Hf+4]
+[174Hf]
+[76Se]
+[168Tm]
+[167Dy]
+[154Gd]
+[95Ru]
+[210At]
+[85Br]
+[59Co]
+[122Xe]
+[27Al]
+[54Cr]
+[198Hg]
+[85Rb+]
+[214Tl]
+[229Rn]
+[218Pb]
+[218Bi]
+[167Tm+3]
+[18o+]
+[P@@H+]
+[P@H+]
+[13N+]
+[212Pb+2]
+[217Bi]
+[249Cf+2]
+[18OH3+]
+[90Sr-]
+[Cf+3]
+[200Hg]
+[86Tc]
+[141Pr+3]
+[141Pr]
+[16nH]
+[14NH4+]
+[132Xe]
+[83Kr]
+[70Zn+2]
+[137Ba+2]
+[36Ar]
+[38Ar]
+[21Ne]
+[126Xe]
+[136Xe]
+[128Xe]
+[134Xe]
+[84Kr]
+[86Kr]
+[78Kr]
+[80Kr]
+[82Kr]
+[67Zn+2]
+[65Cu+2]
+[110Te]
+[58Fe+3]
+[142Nd]
+[38K]
+[198Au+3]
+[122IH]
+[38PH3]
+[130I-]
+[40K+]
+[38K+]
+[28Mg+2]
+[208Tl+]
+[13OH2]
+[198Bi]
+[192Bi]
+[194Bi]
+[196Bi]
+[132I-]
+[83Sr+2]
+[169Er+3]
+[122I-]
+[120I-]
+[92Sr+2]
+[126I-]
+[24Mg]
+[84Sr]
+[118Pd+2]
+[118Pd]
+[AsH4]
+[127I-]
+[9C-]
+[11CH3+]
+[17B]
+[7B]
+[4HH]
+[18C-]
+[22CH3-]
+[22CH4]
+[17C-]
+[15CH3]
+[16CH3]
+[11NH3]
+[21NH3]
+[11N-]
+[11NH]
+[16CH]
+[17CH2]
+[99Ru+2]
+[181Ta+2]
+[181Ta]
+[20CH]
+[32PH2]
+[55Fe+2]
+[SH3]
+[S@H]
+[Mn-]
+[IH4]
+[ThH]
+[GaH-]
+[BiH+]
+[EuH2]
+[FeH4-3]
+[FeH6]
+[IH5]
+[NiH+]
+[SrH2]
+[VH4]
+[YH3]
+[seH+]
+<unk>

smi-ted/inference/smi_ted_light/load.py ADDED Viewed

	@@ -0,0 +1,642 @@

+PATTERN = "(\[[^\]]+]|Br?|Cl?|N|O|S|P|F|I|b|c|n|o|s|p|\(|\)|\.|=|#|-|\+|\\\\|\/|:|~|@|\?|>|\*|\$|\%[0-9]{2}|[0-9])"
+# Deep learning
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.backends.cudnn as cudnn
+# Transformers
+from fast_transformers.attention import AttentionLayer
+from fast_transformers.events import QKVEvent
+from fast_transformers.transformers import TransformerEncoder, TransformerEncoderLayer
+from fast_transformers.builders.transformer_builders import BaseTransformerEncoderBuilder
+from fast_transformers.builders.attention_builders import AttentionBuilder
+from fast_transformers.feature_maps import GeneralizedRandomFeatures
+from fast_transformers.masking import LengthMask
+from transformers import BertTokenizer
+from huggingface_hub import hf_hub_download
+# Data
+import numpy as np
+import pandas as pd
+# Chemistry
+from rdkit import Chem
+from rdkit.Chem import PandasTools
+from rdkit.Chem import Descriptors
+PandasTools.RenderImagesInAllDataFrames(True)
+# Standard library
+from functools import partial
+import regex as re
+import random
+import os
+import gc
+from tqdm import tqdm
+tqdm.pandas()
+# function to canonicalize SMILES
+def normalize_smiles(smi, canonical=True, isomeric=False):
+    try:
+        normalized = Chem.MolToSmiles(
+            Chem.MolFromSmiles(smi), canonical=canonical, isomericSmiles=isomeric
+        )
+    except:
+        normalized = None
+    return normalized
+class MolTranBertTokenizer(BertTokenizer):
+    def __init__(self, vocab_file: str = '',
+                 do_lower_case=False,
+                 unk_token='<pad>',
+                 sep_token='<eos>',
+                 pad_token='<pad>',
+                 cls_token='<bos>',
+                 mask_token='<mask>',
+                 **kwargs):
+        super().__init__(vocab_file,
+                         unk_token=unk_token,
+                         sep_token=sep_token,
+                         pad_token=pad_token,
+                         cls_token=cls_token,
+                         mask_token=mask_token,
+                         **kwargs)
+        self.regex_tokenizer = re.compile(PATTERN)
+        self.wordpiece_tokenizer = None
+        self.basic_tokenizer = None
+        with open(vocab_file) as f:
+            self.padding_idx = f.readlines().index(pad_token+'\n')
+    def _tokenize(self, text):
+        split_tokens = self.regex_tokenizer.findall(text)
+        return split_tokens
+    def convert_idx_to_tokens(self, idx_tensor):
+        tokens = [self.convert_ids_to_tokens(idx) for idx in idx_tensor.tolist()]
+        return tokens
+    def convert_tokens_to_string(self, tokens):
+        stopwords = ['<bos>', '<eos>']
+        clean_tokens = [word for word in tokens if word not in stopwords]
+        out_string = ''.join(clean_tokens)
+        return out_string
+    def get_padding_idx(self):
+        return self.padding_idx
+    def idx_to_smiles(self, torch_model, idx):
+        '''Convert tokens idx back to SMILES text'''
+        rev_tokens = torch_model.tokenizer.convert_idx_to_tokens(idx)
+        flat_list_tokens = [item for sublist in rev_tokens for item in sublist]
+        decoded_smiles = torch_model.tokenizer.convert_tokens_to_string(flat_list_tokens)
+        return decoded_smiles
+## Transformer layers
+class RotaryEmbedding(torch.nn.Module):
+    def __init__(self, dim, base=10000):
+        super().__init__()
+        inv_freq = 1. / (base ** (torch.arange(0, dim, 2).float() / dim))
+        self.register_buffer('inv_freq', inv_freq)
+        self.seq_len_cached = 0
+        self.cos_cached = None
+        self.sin_cached = None
+    def forward(self, x, seq_dim=1):
+        seq_len = x.shape[seq_dim]
+        if seq_len != self.seq_len_cached:
+            self.seq_len_cached = seq_len
+            t = torch.arange(x.shape[seq_dim], device=x.device).type_as(self.inv_freq)
+            freqs = torch.einsum('i,j->ij', t, self.inv_freq)
+            emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
+            self.cos_cached = emb.cos()[None,:, None, :]
+            self.sin_cached = emb.sin()[None,:, None, :]
+        return self.cos_cached, self.sin_cached
+def rotate_half(x):
+    x1, x2 = x[..., :x.shape[-1] // 2], x[..., x.shape[-1] // 2:]
+    return torch.cat((-x2, x1), dim=x1.ndim - 1) # dim=-1 triggers a bug in earlier torch versions
+@torch.jit.script
+def apply_rotary_pos_emb(q, k, cos, sin):
+    return (q * cos) + (rotate_half(q) * sin), (k * cos) + (rotate_half(k) * sin)
+class RotateAttentionLayer(AttentionLayer):
+    """Rotate attention layer inherits from fast_transformer attention layer.
+        The only thing added is an Embedding encoding, for more information
+        on the attention layer see the fast_transformers code
+    """
+    def __init__(self, attention, d_model, n_heads, d_keys=None,
+                 d_values=None, event_dispatcher=""):
+        super(RotateAttentionLayer, self).__init__(attention,d_model, n_heads, d_keys=d_keys,
+                 d_values=d_values, event_dispatcher=event_dispatcher)
+        self.rotaryemb = RotaryEmbedding(d_keys)
+        print('Using Rotation Embedding')
+    def forward(self, queries, keys, values, attn_mask, query_lengths,
+                key_lengths):
+        """
+        Using the same frame work as the fast_Transformers attention layer
+        but injecting rotary information to the queries and the keys
+        after the keys and queries are projected.
+        In the argument description we make use of the following sizes
+            - N: the batch size
+            - L: The maximum length of the queries
+            - S: The maximum length of the keys (the actual length per sequence
+              is given by the length mask)
+            - D: The input feature dimensionality passed in the constructor as
+              'd_model'
+        Arguments
+        ---------
+            queries: (N, L, D) The tensor containing the queries
+            keys: (N, S, D) The tensor containing the keys
+            values: (N, S, D) The tensor containing the values
+            attn_mask: An implementation of BaseMask that encodes where each
+                       query can attend to
+            query_lengths: An implementation of  BaseMask that encodes how
+                           many queries each sequence in the batch consists of
+            key_lengths: An implementation of BaseMask that encodes how
+                         many queries each sequence in the batch consists of
+        Returns
+        -------
+            The new value for each query as a tensor of shape (N, L, D).
+        """
+        # Extract the dimensions into local variables
+        N, L, _ = queries.shape
+        _, S, _ = keys.shape
+        H = self.n_heads
+        # Project the queries/keys/values
+        queries = self.query_projection(queries).view(N, L, H, -1)
+        keys = self.key_projection(keys).view(N, S, H, -1)
+        cos, sin = self.rotaryemb(queries)
+        queries, keys = apply_rotary_pos_emb(queries, keys, cos, sin)
+        values = self.value_projection(values).view(N, S, H, -1)
+        # Let the world know of the qkv
+        self.event_dispatcher.dispatch(QKVEvent(self, queries, keys, values))
+        # Compute the attention
+        new_values = self.inner_attention(
+            queries,
+            keys,
+            values,
+            attn_mask,
+            query_lengths,
+            key_lengths
+        ).view(N, L, -1)
+        # Project the output and return
+        return self.out_projection(new_values)
+class RotateEncoderBuilder(BaseTransformerEncoderBuilder):
+    """Build a batch transformer encoder with Relative Rotary embeddings
+    for training or processing of sequences all elements at a time.
+    Example usage:
+        builder = RotateEncoderBuilder()
+        builder.n_layers = 12
+        builder.n_heads = 8
+        builder.feed_forward_dimensions = 1024
+        builder.query_dimensions = 64
+        builder.value_dimensions = 64
+        builder.dropout = 0.1
+        builder.attention_dropout = 0.1
+        builder.attention_type = "linear"
+        transformer = builder.get()
+    """
+    def _get_attention_builder(self):
+        """Return an instance of the appropriate attention builder."""
+        return AttentionBuilder()
+    def _get_attention_layer_class(self):
+        """Return the class for the layer that projects queries keys and
+        values."""
+        return RotateAttentionLayer
+    def _get_encoder_class(self):
+        """Return the class for the transformer encoder."""
+        return TransformerEncoder
+    def _get_encoder_layer_class(self):
+        """Return the class for the transformer encoder layer."""
+        return TransformerEncoderLayer
+class AutoEncoderLayer(nn.Module):
+    def __init__(self, feature_size, latent_size):
+        super().__init__()
+        self.encoder = self.Encoder(feature_size, latent_size)
+        self.decoder = self.Decoder(feature_size, latent_size)
+    class Encoder(nn.Module):
+        def __init__(self, feature_size, latent_size):
+            super().__init__()
+            self.is_cuda_available = torch.cuda.is_available()
+            self.fc1 = nn.Linear(feature_size, latent_size)
+            self.ln_f = nn.LayerNorm(latent_size)
+            self.lat = nn.Linear(latent_size, latent_size, bias=False)
+        def forward(self, x):
+            if self.is_cuda_available:
+                self.fc1.cuda()
+                self.ln_f.cuda()
+                self.lat.cuda()
+                x = x.cuda()
+            x = F.gelu(self.fc1(x))
+            x = self.ln_f(x)
+            x = self.lat(x)
+            return x # -> (N, D)
+    class Decoder(nn.Module):
+        def __init__(self, feature_size, latent_size):
+            super().__init__()
+            self.is_cuda_available = torch.cuda.is_available()
+            self.fc1 = nn.Linear(latent_size, latent_size)
+            self.ln_f = nn.LayerNorm(latent_size)
+            self.rec = nn.Linear(latent_size, feature_size, bias=False)
+        def forward(self, x):
+            if self.is_cuda_available:
+                self.fc1.cuda()
+                self.ln_f.cuda()
+                self.rec.cuda()
+                x = x.cuda()
+            x = F.gelu(self.fc1(x))
+            x = self.ln_f(x)
+            x = self.rec(x)
+            return x # -> (N, L*D)
+class LangLayer(nn.Module):
+    def __init__(self, n_embd, n_vocab):
+        super().__init__()
+        self.is_cuda_available = torch.cuda.is_available()
+        self.embed = nn.Linear(n_embd, n_embd)
+        self.ln_f = nn.LayerNorm(n_embd)
+        self.head = nn.Linear(n_embd, n_vocab, bias=False)
+    def forward(self, tensor):
+        if self.is_cuda_available:
+            self.embed.cuda()
+            self.ln_f.cuda()
+            self.head.cuda()
+            tensor = tensor.cuda()
+        tensor = self.embed(tensor)
+        tensor = F.gelu(tensor)
+        tensor = self.ln_f(tensor)
+        tensor = self.head(tensor)
+        return tensor
+class Net(nn.Module):
+    def __init__(self, smiles_embed_dim, n_output=1, dropout=0.2):
+        super().__init__()
+        self.desc_skip_connection = True
+        self.fc1 = nn.Linear(smiles_embed_dim, smiles_embed_dim)
+        self.dropout1 = nn.Dropout(dropout)
+        self.relu1 = nn.GELU()
+        self.fc2 = nn.Linear(smiles_embed_dim, smiles_embed_dim)
+        self.dropout2 = nn.Dropout(dropout)
+        self.relu2 = nn.GELU()
+        self.final = nn.Linear(smiles_embed_dim, n_output)
+    def forward(self, smiles_emb, multitask=False):
+        x_out = self.fc1(smiles_emb)
+        x_out = self.dropout1(x_out)
+        x_out = self.relu1(x_out)
+        if self.desc_skip_connection is True:
+            x_out = x_out + smiles_emb
+        z = self.fc2(x_out)
+        z = self.dropout2(z)
+        z = self.relu2(z)
+        if self.desc_skip_connection is True:
+            z = self.final(z + x_out)
+        else:
+            z = self.final(z)
+        if multitask:
+            return F.sigmoid(z)
+        return z
+class MoLEncoder(nn.Module):
+    def __init__(self, config, n_vocab):
+        super(MoLEncoder, self).__init__()
+        # embeddings
+        self.config = config
+        self.tok_emb = nn.Embedding(n_vocab, config['n_embd'])
+        self.drop = nn.Dropout(config['d_dropout'])
+        # transformer
+        builder = RotateEncoderBuilder.from_kwargs(
+            n_layers=config['n_layer'],
+            n_heads=config['n_head'],
+            query_dimensions=config['n_embd']//config['n_head'],
+            value_dimensions=config['n_embd']//config['n_head'],
+            feed_forward_dimensions=config['n_embd'],
+            attention_type='linear',
+            # unless we do deterministic_eval here, we will have random outputs
+            feature_map=partial(GeneralizedRandomFeatures,
+                                n_dims=config['num_feats'],
+                                deterministic_eval=True),
+            activation='gelu'
+        )
+        self.blocks = builder.get()
+        # classification
+        self.lang_model = LangLayer(config['n_embd'], n_vocab)
+    def forward(self, idx, mask):
+        # transformer encoder
+        x = self.tok_emb(idx) # each index maps to a (learnable) vector
+        x = self.drop(x)
+        x = self.blocks(x, length_mask=LengthMask(mask.sum(-1), max_len=idx.shape[1]))
+        # add padding
+        token_embeddings = x
+        input_mask_expanded = mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        mask_embeddings = (token_embeddings * input_mask_expanded)
+        token_embeddings = F.pad(mask_embeddings, pad=(0, 0, 0, self.config['max_len'] - mask_embeddings.shape[1]), value=0)
+        return token_embeddings
+class MoLDecoder(nn.Module):
+    def __init__(self, n_vocab, max_len, n_embd, n_gpu=None):
+        super(MoLDecoder, self).__init__()
+        self.max_len = max_len
+        self.n_embd = n_embd
+        self.n_gpu = n_gpu
+        self.autoencoder = AutoEncoderLayer(n_embd*max_len, n_embd)
+        self.lang_model = LangLayer(n_embd, n_vocab)
+class Smi_ted(nn.Module):
+    """materials.smi-ted-Light 289M Parameters"""
+    def __init__(self, tokenizer, config=None):
+        super(Smi_ted, self).__init__()
+        # configuration
+        self.config = config
+        self.tokenizer = tokenizer
+        self.padding_idx = tokenizer.get_padding_idx()
+        self.n_vocab = len(self.tokenizer.vocab)
+        self.is_cuda_available = torch.cuda.is_available()
+        # instantiate modules
+        if self.config:
+            self.encoder = MoLEncoder(self.config, self.n_vocab)
+            self.decoder = MoLDecoder(self.n_vocab, self.config['max_len'], self.config['n_embd'])
+            self.net = Net(self.config['n_embd'], n_output=self.config['n_output'], dropout=self.config['d_dropout'])
+    def load_checkpoint(self, ckpt_path):
+        # load checkpoint file
+        checkpoint = torch.load(ckpt_path, map_location=torch.device('cpu'))
+        # load hyparameters
+        self.config = checkpoint['hparams']
+        self.max_len = self.config['max_len']
+        self.n_embd = self.config['n_embd']
+        self._set_seed(self.config['seed'])
+        # instantiate modules
+        self.encoder = MoLEncoder(self.config, self.n_vocab)
+        self.decoder = MoLDecoder(self.n_vocab, self.max_len, self.n_embd)
+        self.net = Net(self.n_embd, n_output=self.config['n_output'] if 'n_output' in self.config else 1, dropout=self.config['d_dropout'])
+        # load weights
+        if 'state_dict' in checkpoint:
+            if isinstance(checkpoint['state_dict'], list):
+                self.encoder.load_state_dict(checkpoint['state_dict'][0], strict=False)
+                self.decoder.load_state_dict(checkpoint['state_dict'][1], strict=False)
+            else:
+                self.load_state_dict(checkpoint['state_dict'], strict=False)
+        elif 'MODEL_STATE' in checkpoint:
+            self.load_state_dict(checkpoint['MODEL_STATE'], strict=False)
+        # load RNG states each time the model and states are loaded from checkpoint
+        if 'rng' in self.config:
+            rng = self.config['rng']
+            for key, value in rng.items():
+                if key =='torch_state':
+                    torch.set_rng_state(value.cpu())
+                elif key =='cuda_state':
+                    torch.cuda.set_rng_state(value.cpu())
+                elif key =='numpy_state':
+                    np.random.set_state(value)
+                elif key =='python_state':
+                    random.setstate(value)
+                else:
+                    print('unrecognized state')
+    def _set_seed(self, value):
+        print('Random Seed:', value)
+        random.seed(value)
+        torch.manual_seed(value)
+        torch.cuda.manual_seed(value)
+        torch.cuda.manual_seed_all(value)
+        np.random.seed(value)
+        cudnn.deterministic = True
+        cudnn.benchmark = False
+    def forward(self, smiles, batch_size=100):
+        return self.decode(self.encode(smiles, batch_size=batch_size, return_torch=True))
+    def tokenize(self, smiles):
+        """Tokenize a string into tokens."""
+        if isinstance(smiles, str):
+            batch = [smiles]
+        else:
+            batch = smiles
+        tokens = self.tokenizer(
+            batch,
+            padding=True,
+            truncation=True,
+            add_special_tokens=True,
+            return_tensors="pt",
+            max_length=self.max_len,
+        )
+        idx = tokens['input_ids'].clone().detach()
+        mask = tokens['attention_mask'].clone().detach()
+        if self.is_cuda_available:
+            return idx.cuda(), mask.cuda()
+        return idx, mask
+    def extract_all(self, smiles):
+        """Extract all elements from each part of smi-ted. Be careful."""
+        # evaluation mode
+        self.encoder.eval()
+        self.decoder.eval()
+        if self.is_cuda_available:
+            self.encoder.cuda()
+            self.decoder.cuda()
+        # handle single str or a list of str
+        smiles = pd.Series(smiles) if isinstance(smiles, str) else pd.Series(list(smiles))
+        smiles = smiles.apply(normalize_smiles)
+        # tokenizer
+        idx, mask = self.tokenize(smiles.to_list())
+        ###########
+        # Encoder #
+        ###########
+        # encoder forward
+        x = self.encoder.tok_emb(idx) # each index maps to a (learnable) vector
+        x = self.encoder.drop(x)
+        x = self.encoder.blocks(x, length_mask=LengthMask(mask.sum(-1)))
+        # mean pooling
+        token_embeddings = x
+        input_mask_expanded = mask.unsqueeze(-1).expand(token_embeddings.size()).float()
+        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
+        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
+        true_set = sum_embeddings / sum_mask # DO NOT USE THIS FOR DOWNSTREAM TASKS, USE `pred_set` INSTEAD
+        # add padding
+        mask_embeddings = (token_embeddings * input_mask_expanded)
+        token_embeddings = F.pad(mask_embeddings, pad=(0, 0, 0, self.max_len - mask_embeddings.shape[1]), value=0)
+        idx = F.pad(idx, pad=(0, self.max_len - idx.shape[1], 0, 0), value=2)
+        true_ids = idx
+        true_cte = token_embeddings
+        true_cte = true_cte.view(-1, self.max_len*self.n_embd)
+        ###########
+        # Decoder #
+        ###########
+        # CTE autoencoder
+        pred_set = self.decoder.autoencoder.encoder(true_cte)
+        pred_cte = self.decoder.autoencoder.decoder(pred_set)
+        # reconstruct tokens
+        pred_ids = self.decoder.lang_model(pred_cte.view(-1, self.max_len, self.n_embd))
+        pred_ids = torch.argmax(pred_ids, axis=-1)
+        return ((true_ids, pred_ids), # tokens
+               (true_cte, pred_cte),  # token embeddings
+               (true_set, pred_set))  # smiles embeddings
+    def extract_embeddings(self, smiles):
+        """Extract token and SMILES embeddings."""
+        # evaluation mode
+        self.encoder.eval()
+        if self.is_cuda_available:
+            self.encoder.cuda()
+        # tokenizer
+        idx, mask = self.tokenize(smiles)
+        # encoder forward
+        token_embeddings = self.encoder(idx, mask)
+        # aggregate token embeddings (similar to mean pooling)
+        # CAUTION: use the embeddings from the autoencoder.
+        smiles_embeddings = self.decoder.autoencoder.encoder(token_embeddings.view(-1, self.max_len*self.n_embd))
+        # add padding
+        idx = F.pad(idx, pad=(0, self.max_len - idx.shape[1], 0, 0), value=self.padding_idx)
+        return idx, token_embeddings, smiles_embeddings
+    def encode(self, smiles, useCuda=False, batch_size=100, return_torch=False):
+        """Extract efficiently SMILES embeddings per batches."""
+        # TODO: remove useCuda argument
+        # handle single str or a list of str
+        smiles = pd.Series(smiles) if isinstance(smiles, str) else pd.Series(list(smiles))
+        smiles = smiles.apply(normalize_smiles)
+        n_split = smiles.shape[0] // batch_size if smiles.shape[0] >= batch_size else smiles.shape[0]
+        # process in batches
+        embeddings = [
+            self.extract_embeddings(list(batch))[2].cpu().detach().numpy()
+                for batch in tqdm(np.array_split(smiles, n_split))
+        ]
+        flat_list = [item for sublist in embeddings for item in sublist]
+        # clear GPU memory
+        if self.is_cuda_available:
+            torch.cuda.empty_cache()
+            gc.collect()
+        if return_torch:
+            return torch.tensor(np.array(flat_list))
+        return pd.DataFrame(flat_list)
+    def decode(self, smiles_embeddings):
+        """Decode SMILES embeddings back to SMILES."""
+        # evaluation mode
+        self.decoder.eval()
+        if self.is_cuda_available:
+            self.decoder.cuda()
+        # reconstruct token embeddings
+        pred_token_embds = self.decoder.autoencoder.decoder(smiles_embeddings)
+        # reconstruct tokens
+        pred_idx = self.decoder.lang_model(pred_token_embds.view(-1, self.max_len, self.n_embd))
+        pred_idx = torch.argmax(pred_idx, axis=-1).cpu().detach().numpy()
+        # convert idx to tokens
+        pred_smiles = []
+        for i in range(pred_idx.shape[0]):
+            idx = pred_idx[i]
+            smiles = self.tokenizer.idx_to_smiles(self, idx)
+            smiles = smiles.replace('<bos>', '') # begin token
+            smiles = smiles.replace('<eos>', '') # end token
+            smiles = smiles.replace('<pad>', '') # pad token
+            pred_smiles.append(smiles)
+        # clear GPU memory
+        if self.is_cuda_available:
+            torch.cuda.empty_cache()
+            gc.collect()
+        return pred_smiles
+    def __str__(self):
+        return 'smi-ted-Light'
+def load_smi_ted(folder="./smi_ted_light",
+              ckpt_filename="smi-ted-Light_40.pt",
+              vocab_filename="bert_vocab_curated.txt"
+              ):
+    repo_id = "ibm/materials.smi-ted"
+    filename = "bert_vocab_curated.txt"
+    vocab_filename = hf_hub_download(repo_id=repo_id, filename=filename)
+    tokenizer = MolTranBertTokenizer(vocab_filename)
+    model = Smi_ted(tokenizer)
+    filename = "smi-ted-Light_40.pt"
+    file_path = hf_hub_download(repo_id=repo_id, filename=filename)
+    model.load_checkpoint(file_path)
+    model.eval()
+    print('Vocab size:', len(tokenizer.vocab))
+    print(f'[INFERENCE MODE - {str(model)}]')
+    return model