Niksa Praljak
Add instruction loading pretrained weights
013cb42

Downloading Pre-trained Weights

To download the ProteoScribe epoch 20 pre-trained weights as a .bin file from Google Drive, use the following command:

pip install gdown
gdown --id 1c3CwvbOP_kp3FpLL1wPrjO6qtY-XiT26 -O BioM3_ProteoScribe_pfam_epoch20_v1.bin

Usage

Once available, the pre-trained weights can be loaded as follows:

import json
import torch
import torch.nn as nn
from argparse import Namespace
import Stage3_source.cond_diff_transformer_layer as Stage3_mod

# Step 1: Load JSON Configuration
def load_json_config(json_path):
    """
    Load a JSON configuration file and return it as a dictionary.
    """
    with open(json_path, "r") as f:
        config = json.load(f)
    return config

# Step 2: Convert JSON Dictionary to Namespace
def convert_to_namespace(config_dict):
    """
    Recursively convert a dictionary to an argparse Namespace.
    """
    for key, value in config_dict.items():
        if isinstance(value, dict):
            config_dict[key] = convert_to_namespace(value)
    return Namespace(**config_dict)

# Step 3: Model Loading Function
def prepare_model(model_path, config_args) -> nn.Module:
    """
    Initialize and load the ProteoScribe model with pre-trained weights.
    """
    # Initialize the model graph
    model = Stage3_mod.get_model(
        args=config_args,
        data_shape=(config_args.image_size, config_args.image_size),
        num_classes=config_args.num_classes
    )

    # Load pre-trained weights
    model.load_state_dict(torch.load(model_path, map_location=config_args.device))
    model.eval()
    
    return model

if __name__ == '__main__':
    # Path to configuration and weights
    config_path = "stage3_config.json"
    model_weights_path = "weights/ProteoScribe/BioM3_ProteoScribe_pfam_epoch20_v1.bin"

    # Load Configuration
    print("Loading configuration...")
    config_dict = load_json_config(config_path)
    config_args = convert_to_namespace(config_dict)
    
    # Set device if not specified in config
    if not hasattr(config_args, 'device'):
        config_args.device = 'cuda' if torch.cuda.is_available() else 'cpu'

    # Load Model
    print("Loading pre-trained model weights...")
    model = prepare_model(model_weights_path, config_args)
    print(f"Model loaded successfully with weights! (Device: {config_args.device})")

Model Structure

The ProteoScribe model is structured as a conditional diffusion transformer that generates protein sequences based on facilitated embeddings. The model consists of:

  1. A transformer-based architecture for sequence generation
  2. Conditional diffusion layers for embedding processing
  3. Output layers for amino acid sequence prediction

Configuration Requirements

The stage3_config.json file should contain the following key parameters:

{
    "image_size": [required_size],
    "num_classes": [num_amino_acids],
    "device": "cuda",  // or "cpu"
    // Additional model-specific parameters
}

Dependencies

Ensure you have the following dependencies installed:

  • PyTorch (latest stable version)
  • Stage3_source module (included in the BioM3 repository)

Important Notes

  1. The model expects facilitated embeddings (z_c) as input, typically generated from Stage 2 (Facilitator)
  2. Model weights are optimized for protein sequence generation tasks
  3. Use CUDA-enabled GPU for optimal performance (if available)
  4. Default configuration is tuned for the Pfam database

Troubleshooting

Common issues and solutions:

  1. CUDA Out of Memory

    • Reduce batch size in configuration
    • Use CPU if GPU memory is insufficient
  2. Module Import Errors

    • Ensure Stage3_source is in Python path
    • Check all dependencies are installed
  3. Weight Loading Issues

    • Verify the downloaded weights file is complete
    • Check model configuration matches pre-trained architecture

For additional support or issues:

  • Open an issue in the BioM3 repository
  • Check the documentation for updates

Citation

If you use these weights in your research, please cite:

Natural Language Prompts Guide the Design of Novel Functional Protein Sequences
bioRxiv 2024.11.11.622734
doi: https://doi.org/10.1101/2024.11.11.622734

Repository maintained by the BioM3 Team