ColonGPT (A colonoscopy-specific multimodal Language Model)


The Gradio Web UI allows you to use our examples or upload your images for inference.

📖 Paper | 🏠 Home

This is the merged weights of ColonGPT-v1-phi1.5-siglip-lora-stg2, including vision encoder (siglip) + language model (phi-1.5), and other fine-tuned weights on our ColonINST.

Our ColonGPT is a standard multimodal language model, which contains four basic components: a language tokenizer, an visual encoder (🤗 SigLIP-SO), a multimodal connector, and a language model (🤗 Phi1.5). In this huggingface page, we provide a quick start for convenient of new users. For further details about ColonGPT, we highly recommend visiting our homepage. There, you'll find comprehensive usage instructions for our model and the latest advancements in intelligent colonoscopy technology.

Quick start

Here is a code snippet to show you how to quickly try-on our ColonGPT model with transformers. The model focuses on three downstream tasks: image classification (CLS), referring expression generation (REG), and referring expression comprehension (REC). If you need a caption generator, please refer to ColonGPT-V1-stg1. For convenience, we manually combined some configuration and code files and merged the weights. Please note that this is a quick code, we recommend you installing ColonGPT's source code to explore more.

  • Before running the snippet, you only need to install the following minimium dependencies.

    conda create -n quickstart python=3.10
    conda activate quickstart
    pip install torch transformers accelerate pillow
    
  • Then you can use python script/quick_start/quickstart.py to start.

    import torch
    import transformers
    from transformers import AutoModelForCausalLM, AutoTokenizer, StoppingCriteria
    from PIL import Image
    import warnings
    
    transformers.logging.set_verbosity_error()
    transformers.logging.disable_progress_bar()
    warnings.filterwarnings('ignore')
    
    device = 'cuda'  # or cpu
    torch.set_default_device(device)
    
    model_name = "ai4colonoscopy/ColonGPT-v1"
    
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        torch_dtype=torch.float16,  # or float32 for cpu
        device_map='auto',
        trust_remote_code=True
    )
    
    tokenizer = AutoTokenizer.from_pretrained(
        model_name,
        trust_remote_code=True
    )
    
    class KeywordsStoppingCriteria(StoppingCriteria):
        def __init__(self, keyword, tokenizer, input_ids):
            self.keyword_id = tokenizer(keyword).input_ids
            self.tokenizer = tokenizer
            self.start_len = input_ids.shape[1]
    
        def __call__(self, input_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
            for keyword_id in self.keyword_id:
                if keyword_id in input_ids[0, -len(self.keyword_id):]:
                    return True
            return False
    
    prompt = "Categorize the object."
    text = f"USER: <image>\n{prompt} ASSISTANT:"
    text_chunks = [tokenizer(chunk).input_ids for chunk in text.split('<image>')]
    input_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1], dtype=torch.long).unsqueeze(0).to(device)
    
    image = Image.open('/home/projects/u7248002/Project/ColonGPT-tmp/cache/examples/example2.png')
    image_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)
    
    stop_str = "<|endoftext|>"
    stopping_criteria = KeywordsStoppingCriteria(stop_str, tokenizer, input_ids)
    
    output_ids = model.generate(
        input_ids,
        images=image_tensor,
        do_sample=False,
        temperature=0,
        max_new_tokens=512,
        use_cache=True,
        stopping_criteria=[stopping_criteria]
    )
    
    outputs = tokenizer.decode(output_ids[0, input_ids.shape[1]:]).replace("<|endoftext|>", "").strip()
    print(outputs)
    

License

This project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses. The content of this project itself is licensed under the Apache license 2.0.

Downloads last month
2
Safetensors
Model size
1.89B params
Tensor type
BF16
·
FP16
·
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The HF Inference API does not support image-text-to-text models for adapter-transformers library.

Model tree for ai4colonoscopy/ColonGPT

Base model

microsoft/phi-1_5
Adapter
(500)
this model

Dataset used to train ai4colonoscopy/ColonGPT