🌟 ViT-L/16 or ViT-B/16 LoRA — ImageNet-1K Fine-tune (Top-1 ≈ 71%)

This repository contains LoRA fine-tuned weights for Vision Transformer (ViT) models trained on the ImageNet-1K classification dataset.

The repository includes:

LoRA adapter weights (query/key/value/output + MLP LoRA, r=32)
Classification head weights (classifier_head.pth)
A detailed model card (this file)

This enables loading the pretrained ViT together with LoRA and the final classifier head to reproduce the reported ImageNet-1K accuracy.

📌 Model Summary

Item	Description
Base model	`google/vit-base-patch16-224` or ViT-Large-16 (depending on your version)
Task	ImageNet-1K classification
Dataset	ILSVRC 2012 ImageNet-1K
Training method	PEFT LoRA (r=32), applied to Q/K/V/O + MLP
Trainable params	~2% of total parameters
Final accuracy	Top-1 ≈ 71%
Resolution	224×224
Optimizer	AdamW
Mixed precision	bf16 / fp16
Data augmentations	RandAugment, Mixup, CutMix, RandomResizedCrop

🚀 How to Use

1️⃣ Load the base ViT model

from transformers import ViTForImageClassification
from peft import PeftModel
import torch

base = ViTForImageClassification.from_pretrained(
    "google/vit-base-patch16-224"
)

2️⃣ Load LoRA adapter

model = PeftModel.from_pretrained(
    base,
    "username/repo-name"
)

3️⃣ Load classification head

state = torch.load("classifier_head.pth", map_location="cpu")
model.base_model.model.classifier.load_state_dict(state)
model.eval()

🎯 Intended Use

This model is designed for:

ImageNet-1K classification
Downstream dataset finetuning via LoRA
Feature extraction / embedding extraction
Transfer learning to custom datasets
Model compression and deployment

It is not intended for:

Safety-critical systems
Medical or legal decision making
Bias-sensitive applications

🧩 Training Details

LoRA Configuration

{
  "r": 32,
  "lora_alpha": 32,
  "lora_dropout": 0.05,
  "target_modules": [
    "query", "key", "value", "output.dense",
    "intermediate.dense", "output.dense"
  ]
}

LoRA is applied to:

Multi-head self-attention Q/K/V/O
MLP hidden layer (to/from dim 4096 → 1024)

Optimizer & Schedule

Optimizer: AdamW
Learning rate: 1e-4
Weight decay: 0.01
LR scheduler: cosine annealing
Warmup: None

📚 Dataset

ImageNet-1K (ILSVRC2012)
1.28M train images
50k validation images
1000 classes

🥇 Evaluation

Metric	Value
Top-1 Accuracy	~71%
Top-5 Accuracy	optional

Evaluation was done on the official 50k-val set.

📦 Files Included

adapter_model.safetensors   # LoRA weights
adapter_config.json         # LoRA configuration
classifier_head.pth         # Final classification head
README.md                   # This model card

📝 Citation

If you use this model, please cite:

@article{hu2021lora,
  title={LoRA: Low-Rank Adaptation of Large Language Models},
  author={Hu, Edward J. and others},
  year={2021}
}

@article{dosovitskiy2020vit,
  title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
  author={Dosovitskiy, Alexey and others},
  year={2020}
}

⚠️ Limitations

Trained only for 224×224 resolution
May be biased to ImageNet categories
LoRA updates only ~2% of parameters (some categories could underfit)

⚖ License

This model inherits the license of the base ViT model and LoRA implementation. Check the respective repos for details.

🙌 Acknowledgements

Thanks to:

Google Research (ViT)
Hugging Face Transformers
PEFT Team (LoRA)
ImageNet dataset maintainers

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for xn6o/lora-vit-large-patch16-224-in21k-r32-imagenet1k

Base model

google/vit-base-patch16-224

Adapter

(19)

this model

xn6o
/

lora-vit-large-patch16-224-in21k-r32-imagenet1k