BLIP Image Captioning - English (Flickr8k)

This model is a fine-tuned version of Salesforce/blip-image-captioning-large, adapted for image captioning in English using the Flickr8K dataset. It takes an input image and generates a relevant caption in English, describing the image content.

Model Sources

Paper: Based on "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"

How to Get Started with the Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt

# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-merged-lora-flickr-8k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-merged-lora-flickr-8k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")

# Show image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()

# Generate English caption
model.eval()
with torch.no_grad():
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    generated_output = model.generate(
        pixel_values=pixel_values,
        max_length=75,
        min_length=5,
        num_beams=5,
        repetition_penalty=1.5,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
    print(caption)  # Prints English caption

🏋️ Training Details

📂 Dataset

Name: Flickr8K
Description: Contains 8,000 images with 5 English captions each.
Preprocessing: Images resized to 384×384, text lowercased and tokenized.

⚙️ Hyperparameters

Optimizer: AdamW
Learning Rate: 5e-5
Batch Size: 16
Precision: FP16 mixed precision
Epochs: 5
LR Scheduler: Cosine with warmup
Weight Decay: 0.01
Rank: 32
Lora Alpha: 64
Lora Dropout: 0.01

📊 Evaluation Results

Metric	Score
BLEU-1	77.30
BLEU-2	59.17
BLEU-3	44.93
BLEU-4	33.30
ROUGE-1	60.08
ROUGE-2	37.10
METEOR	59.63

Evaluation

Testing Data

The model was evaluated on the Flickr8k test split, which contains 1,000 images with 5 reference captions each.

Results

The model performs well on everyday scenes and common activities, generating grammatically correct and contextually appropriate English captions.
Performance may be slightly lower for highly specific or rare visual concepts.

Bias, Risks, and Limitations

The model was trained on Flickr8k, which may not represent the full diversity of visual scenes worldwide.
May produce culturally biased or stereotypical descriptions.
May struggle with complex, ambiguous, or unusual scenes.