BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation
Paper
β’ 2201.12086 β’ Published
β’ 3
This model is a fine-tuned version of Salesforce/blip-image-captioning-large, adapted for image captioning in English using the Flickr30K dataset. It takes an input image and generates a relevant caption in English, describing the image content.
from transformers import BlipProcessor, BlipForConditionalGeneration
from PIL import Image
import torch
import matplotlib.pyplot as plt
# Load model and processor
processor = BlipProcessor.from_pretrained("omarsabri8756/blip-merged-lora-flickr-30k")
model = BlipForConditionalGeneration.from_pretrained("omarsabri8756/blip-merged-lora-flickr-30k")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
# Load an image from local path
image_path = "path/to/your/image.jpg"
image = Image.open(image_path).convert("RGB")
# Show image
plt.imshow(image)
plt.axis('off')
plt.title("Input Image")
plt.show()
# Generate English caption
model.eval()
with torch.no_grad():
pixel_values = processor(images=image, return_tensors="pt").pixel_values.to(device)
generated_output = model.generate(
pixel_values=pixel_values,
max_length=75,
min_length=5,
num_beams=5,
repetition_penalty=1.5,
length_penalty=1.0,
no_repeat_ngram_size=3,
early_stopping=True
)
caption = processor.batch_decode(generated_output, skip_special_tokens=True)[0]
print(caption) # Prints English caption
| Metric | Score |
|---|---|
| BLEU-1 | 75 |
| BLEU-2 | 55 |
| BLEU-3 | 41 |
| BLEU-4 | 30 |
| ROUGE-1 | 57 |
| ROUGE-2 | 34 |
| METEOR | 54 |
The model was evaluated on the Flickr30k test split, which contains 1,000 images with 5 reference captions each.
Base model
Salesforce/blip-image-captioning-large