BLIP Image Captioning - English (Flickr30k)

This model is a fine-tuned version of Salesforce/blip-image-captioning-large, adapted for image captioning in English using the Flickr30K dataset. It takes an input image and generates a relevant caption in English, describing the image content.

Model Sources

Paper: Based on "BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation"

How to Get Started with the Model

from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt

# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-merged-lora-flickr-30k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-merged-lora-flickr-30k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")

# Show image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()

# Generate English caption
model.eval()
with torch.no_grad():
    pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
    generated_output = model.generate(
        pixel_values=pixel_values,
        max_length=75,
        min_length=5,
        num_beams=5,
        repetition_penalty=1.5,
        length_penalty=1.0,
        no_repeat_ngram_size=3,
        early_stopping=True
    )
    caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
    print(caption)  # Prints English caption

🏋️ Training Details

📂 Dataset

Name: Flickr30K
Description: Contains 30,000 images with 5 English captions each.
Preprocessing: Images resized to 384×384, text lowercased and tokenized.

⚙️ Hyperparameters

Optimizer: AdamW
Learning Rate: 5e-5
Batch Size: 16
Precision: FP16 mixed precision
Epochs: 5
LR Scheduler: Cosine with warmup
Weight Decay: 0.01
Rank: 32
Lora Alpha: 64
Lora Dropout: 0.01

📊 Evaluation Results

Metric	Score
BLEU-1	75
BLEU-2	55
BLEU-3	41
BLEU-4	30
ROUGE-1	57
ROUGE-2	34
METEOR	54

Evaluation

Testing Data

The model was evaluated on the Flickr30k test split, which contains 1,000 images with 5 reference captions each.

Results

The model performs well on everyday scenes and common activities, generating grammatically correct and contextually appropriate English captions.
Performance may be slightly lower for highly specific or rare visual concepts.

Downloads last month: 2

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for omarsabri8756/blip-merged-lora-flickr-30k

Base model

Salesforce/blip-image-captioning-large

Finetuned

(13)

this model

Paper for omarsabri8756/blip-merged-lora-flickr-30k

BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation

Paper • 2201.12086 • Published Jan 28, 2022 • 3