ClipCap fine-tuned for Narrative Image Captioning

ClipCap base trained on the HL Narratives for high-level narrative descriptions generation

Model fine-tuning πŸ‹οΈβ€

We fine-tune LM + Mapping Network starting from the model pretrained on COCO

  • Trained for 3 epochs
  • lr: 5eβˆ’5
  • Adam optimizer
  • half-precision (fp16)

Test set metrics 🧾

| Cider  | SacreBLEU  | Rouge-L|
|--------|------------|--------|
| 63.91  |   8.15     |  24.53 |

Demo

Open In Colab

Installation

pip install git+https://github.com/michelecafagna26/CLIPCap.git

Download the model

git lfs install # if not installed
git clone https://huggingface.co/michelecafagna26/clipcap-base-captioning-ft-hl-narratives

Model in Action πŸš€

from clipcap import ClipCaptionModel
from transformers import (
    GPT2Tokenizer,
    GPT2LMHeadModel,
)
import torch
import clip
import requests
from PIL import Image

model_path = "clipcap-base-captioning-ft-hl-narratives/pytorch_model.pt" # change accordingly

# load clip
device = "cuda" if torch.cuda.is_available() else "cpu"
clip_model, preprocess = clip.load("ViT-B/32", device=device, jit=False)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
prefix_length = 10

# load ClipCap
model = ClipCaptionModel(prefix_length, tokenizer=tokenizer)
model.from_pretrained(model_path)
model = model.eval()
model = model.to(device)

# load the image
img_url = 'https://datasets-server.huggingface.co/assets/michelecafagna26/hl-narratives/--/default/train/3/image/image.jpg' 
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')


# extract the prefix
image = preprocess(raw_image).unsqueeze(0).to(device)
with torch.no_grad():
    prefix = clip_model.encode_image(image).to(
        device, dtype=torch.float32
    )
    prefix_embed = model.clip_project(prefix).reshape(1, prefix_length, -1)

# generate the caption   
model.generate_beam(embed=prefix_embed)[0]


# >> "He is riding a skateboard in a skate park, he wants to skate."

BibTex and citation info

@inproceedings{cafagna2023hl,
  title={{HL} {D}ataset: {V}isually-grounded {D}escription of {S}cenes, {A}ctions and
{R}ationales},
  author={Cafagna, Michele and van Deemter, Kees and Gatt, Albert},
  booktitle={Proceedings of the 16th International Natural Language Generation Conference (INLG'23)},
address = {Prague, Czech Republic},
  year={2023}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.
The model cannot be deployed to the HF Inference API: The model has no pipeline_tag.

Dataset used to train michelecafagna26/clipcap-base-captioning-ft-hl-narratives