BLIP Image Captioning - English (Flickr8k)
This model is a fine-tuned version of Salesforce/blip-image-captioning-large
, adapted for image captioning in English using the Flickr8K dataset. It takes an input image and generates a relevant caption in English, describing the image content.
Model Sources
- Paper: Based on "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"
How to Get Started with the Model
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt
# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-merged-lora-flickr-8k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-merged-lora-flickr-8k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")
# Show image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()
# Generate English caption
model.eval()
with torch.no_grad():
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
generated_output = model.generate(
pixel_values=pixel_values,
max_length=75,
min_length=5,
num_beams=5,
repetition_penalty=1.5,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True
)
caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
print(caption) # Prints English caption
ποΈ Training Details
π Dataset
- Name: Flickr8K
- Description: Contains 8,000 images with 5 English captions each.
- Preprocessing: Images resized to 384Γ384, text lowercased and tokenized.
βοΈ Hyperparameters
- Optimizer: AdamW
- Learning Rate: 5e-5
- Batch Size: 16
- Precision: FP16 mixed precision
- Epochs: 5
- LR Scheduler: Cosine with warmup
- Weight Decay: 0.01
- Rank: 32
- Lora Alpha: 64
- Lora Dropout: 0.01
π Evaluation Results
Metric | Score |
---|---|
BLEU-1 | 77.30 |
BLEU-2 | 59.17 |
BLEU-3 | 44.93 |
BLEU-4 | 33.30 |
ROUGE-1 | 60.08 |
ROUGE-2 | 37.10 |
METEOR | 59.63 |
Evaluation
Testing Data
The model was evaluated on the Flickr8k test split, which contains 1,000 images with 5 reference captions each.
Results
The model performs well on everyday scenes and common activities, generating grammatically correct and contextually appropriate English captions.
Performance may be slightly lower for highly specific or rare visual concepts.
Bias, Risks, and Limitations
- The model was trained on Flickr8k, which may not represent the full diversity of visual scenes worldwide.
- May produce culturally biased or stereotypical descriptions.
- May struggle with complex, ambiguous, or unusual scenes.
- Downloads last month
- 11
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
2
Ask for provider support