π ViT-L/16 or ViT-B/16 LoRA β ImageNet-1K Fine-tune (Top-1 β 71%)
This repository contains LoRA fine-tuned weights for Vision Transformer (ViT) models trained on the ImageNet-1K classification dataset.
The repository includes:
- LoRA adapter weights (query/key/value/output + MLP LoRA, r=32)
- Classification head weights (
classifier_head.pth) - A detailed model card (this file)
This enables loading the pretrained ViT together with LoRA and the final classifier head to reproduce the reported ImageNet-1K accuracy.
π Model Summary
| Item | Description |
|---|---|
| Base model | google/vit-base-patch16-224 or ViT-Large-16 (depending on your version) |
| Task | ImageNet-1K classification |
| Dataset | ILSVRC 2012 ImageNet-1K |
| Training method | PEFT LoRA (r=32), applied to Q/K/V/O + MLP |
| Trainable params | ~2% of total parameters |
| Final accuracy | Top-1 β 71% |
| Resolution | 224Γ224 |
| Optimizer | AdamW |
| Mixed precision | bf16 / fp16 |
| Data augmentations | RandAugment, Mixup, CutMix, RandomResizedCrop |
π How to Use
1οΈβ£ Load the base ViT model
from transformers import ViTForImageClassification
from peft import PeftModel
import torch
base = ViTForImageClassification.from_pretrained(
"google/vit-base-patch16-224"
)
2οΈβ£ Load LoRA adapter
model = PeftModel.from_pretrained(
base,
"username/repo-name"
)
3οΈβ£ Load classification head
state = torch.load("classifier_head.pth", map_location="cpu")
model.base_model.model.classifier.load_state_dict(state)
model.eval()
π― Intended Use
This model is designed for:
- ImageNet-1K classification
- Downstream dataset finetuning via LoRA
- Feature extraction / embedding extraction
- Transfer learning to custom datasets
- Model compression and deployment
It is not intended for:
- Safety-critical systems
- Medical or legal decision making
- Bias-sensitive applications
π§© Training Details
LoRA Configuration
{
"r": 32,
"lora_alpha": 32,
"lora_dropout": 0.05,
"target_modules": [
"query", "key", "value", "output.dense",
"intermediate.dense", "output.dense"
]
}
LoRA is applied to:
- Multi-head self-attention Q/K/V/O
- MLP hidden layer (to/from dim 4096 β 1024)
Optimizer & Schedule
- Optimizer: AdamW
- Learning rate: 1e-4
- Weight decay: 0.01
- LR scheduler: cosine annealing
- Warmup: None
π Dataset
- ImageNet-1K (ILSVRC2012)
- 1.28M train images
- 50k validation images
- 1000 classes
π₯ Evaluation
| Metric | Value |
|---|---|
| Top-1 Accuracy | ~71% |
| Top-5 Accuracy | optional |
Evaluation was done on the official 50k-val set.
π¦ Files Included
adapter_model.safetensors # LoRA weights
adapter_config.json # LoRA configuration
classifier_head.pth # Final classification head
README.md # This model card
π Citation
If you use this model, please cite:
@article{hu2021lora,
title={LoRA: Low-Rank Adaptation of Large Language Models},
author={Hu, Edward J. and others},
year={2021}
}
@article{dosovitskiy2020vit,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and others},
year={2020}
}
β οΈ Limitations
- Trained only for 224Γ224 resolution
- May be biased to ImageNet categories
- LoRA updates only ~2% of parameters (some categories could underfit)
β License
This model inherits the license of the base ViT model and LoRA implementation. Check the respective repos for details.
π Acknowledgements
Thanks to:
- Google Research (ViT)
- Hugging Face Transformers
- PEFT Team (LoRA)
- ImageNet dataset maintainers
Model tree for xn6o/lora-vit-large-patch16-224-in21k-r32-imagenet1k
Base model
google/vit-base-patch16-224