🌟 ViT-L/16 or ViT-B/16 LoRA β€” ImageNet-1K Fine-tune (Top-1 β‰ˆ 71%)

This repository contains LoRA fine-tuned weights for Vision Transformer (ViT) models trained on the ImageNet-1K classification dataset.

The repository includes:

  • LoRA adapter weights (query/key/value/output + MLP LoRA, r=32)
  • Classification head weights (classifier_head.pth)
  • A detailed model card (this file)

This enables loading the pretrained ViT together with LoRA and the final classifier head to reproduce the reported ImageNet-1K accuracy.


πŸ“Œ Model Summary

Item Description
Base model google/vit-base-patch16-224 or ViT-Large-16 (depending on your version)
Task ImageNet-1K classification
Dataset ILSVRC 2012 ImageNet-1K
Training method PEFT LoRA (r=32), applied to Q/K/V/O + MLP
Trainable params ~2% of total parameters
Final accuracy Top-1 β‰ˆ 71%
Resolution 224Γ—224
Optimizer AdamW
Mixed precision bf16 / fp16
Data augmentations RandAugment, Mixup, CutMix, RandomResizedCrop

πŸš€ How to Use

1️⃣ Load the base ViT model

from transformers import ViTForImageClassification
from peft import PeftModel
import torch
base = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224"
)

2️⃣ Load LoRA adapter

model = PeftModel.from_pretrained(
    base,
    "username/repo-name"
)

3️⃣ Load classification head

state = torch.load("classifier_head.pth", map_location="cpu")
model.base_model.model.classifier.load_state_dict(state)
model.eval()

🎯 Intended Use

This model is designed for:

  • ImageNet-1K classification
  • Downstream dataset finetuning via LoRA
  • Feature extraction / embedding extraction
  • Transfer learning to custom datasets
  • Model compression and deployment

It is not intended for:

  • Safety-critical systems
  • Medical or legal decision making
  • Bias-sensitive applications

🧩 Training Details

LoRA Configuration

{
  "r": 32,
  "lora_alpha": 32,
  "lora_dropout": 0.05,
  "target_modules": [
    "query", "key", "value", "output.dense",
    "intermediate.dense", "output.dense"
  ]
}

LoRA is applied to:

  • Multi-head self-attention Q/K/V/O
  • MLP hidden layer (to/from dim 4096 β†’ 1024)

Optimizer & Schedule

  • Optimizer: AdamW
  • Learning rate: 1e-4
  • Weight decay: 0.01
  • LR scheduler: cosine annealing
  • Warmup: None

πŸ“š Dataset

  • ImageNet-1K (ILSVRC2012)
  • 1.28M train images
  • 50k validation images
  • 1000 classes

πŸ₯‡ Evaluation

Metric Value
Top-1 Accuracy ~71%
Top-5 Accuracy optional

Evaluation was done on the official 50k-val set.


πŸ“¦ Files Included

adapter_model.safetensors   # LoRA weights
adapter_config.json         # LoRA configuration
classifier_head.pth         # Final classification head
README.md                   # This model card

πŸ“ Citation

If you use this model, please cite:

@article{hu2021lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and others},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and others},
  year={2020}
}

⚠️ Limitations

  • Trained only for 224Γ—224 resolution
  • May be biased to ImageNet categories
  • LoRA updates only ~2% of parameters (some categories could underfit)

βš– License

This model inherits the license of the base ViT model and LoRA implementation. Check the respective repos for details.


πŸ™Œ Acknowledgements

Thanks to:

  • Google Research (ViT)
  • Hugging Face Transformers
  • PEFT Team (LoRA)
  • ImageNet dataset maintainers
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for xn6o/lora-vit-large-patch16-224-in21k-r32-imagenet1k

Adapter
(19)
this model

Dataset used to train xn6o/lora-vit-large-patch16-224-in21k-r32-imagenet1k