DINOv3 ViT-H/16+ Booru Tagger

A multi-label image tagger trained on e621 and Danbooru annotations, using a DINOv3 ViT-H/16+ backbone fine-tuned end-to-end with a single linear projection head.

Model Details

Property Value
Backbone facebook/dinov3-vith16plus-pretrain-lvd1689m
Architecture ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens
Head Linear((1 + 4) × 1280 → 74 625) — CLS + 4 register tokens concatenated
Vocabulary 74 625 tags (min frequency ≥ 50 across training set)
Input resolution Any multiple of 16 px — trained at 512 px, generalises to higher resolutions
Input normalisation ImageNet mean/std [0.485, 0.456, 0.406] / [0.229, 0.224, 0.225]
Output Raw logits — apply sigmoid for per-tag probabilities
Parameters ~632 M (backbone) + ~480 M (head)

Training

Hyperparameter Value
Training data e621 + Danbooru (parquet)
Batch size 32
Learning rate 1e-6
Warmup steps 50
Loss BCEWithLogitsLoss with per-tag pos_weight = (neg/pos)^(1/T), cap 100
Optimiser AdamW (β₁=0.9, β₂=0.999, wd=0.01)
Precision bfloat16 (backbone) / float32 (projection + loss)
Hardware 2× GPU, ThreadPoolExecutor + NCCL all-reduce

Usage

Standalone (no transformers dependency)

from inference_tagger_standalone import Tagger

tagger = Tagger(
    checkpoint_path="tagger_proto.safetensors",
    vocab_path="tagger_vocab_with_categories.json",
    device="cuda",
)

tags = tagger.predict("photo.jpg", topk=40)
# → [("solo", 0.98), ("anthro", 0.95), ...]

# or threshold-based
tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)

CLI

# top-30 tags, pretty output
python inference_tagger_standalone.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories.json \
    --images photo.jpg https://example.com/image.jpg \
    --topk 30

# comma-separated string (pipe into diffusion trainer)
python inference_tagger_standalone.py ... --format tags

# JSON
python inference_tagger_standalone.py ... --format json

Web UI

pip install fastapi uvicorn jinja2 aiofiles

python tagger_ui_server.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories.json \
    --port 7860
# → open http://localhost:7860

Files

File Description
*.safetensors Model weights (bfloat16)
tagger_vocab_with_categories.json {"idx2tag": [...]} — 74 625 tag strings ordered by training frequency
inference_tagger_standalone.py Self-contained inference script (no transformers dep)
tagger_ui_server.py FastAPI + Jinja2 web UI server

Tag Vocabulary

Tags are sourced from e621 and Danbooru annotations and cover:

  • Subject — species, character count, gender (solo, duo, anthro, 1girl, male, …)
  • Body — anatomy, fur/scale/skin markings, body parts
  • Action / poselooking at viewer, sitting, …
  • Scene — background, lighting, setting
  • Styledigital art, hi res, sketch, watercolor, …
  • Rating — explicit content tags are included; filter as needed for your use case

Minimum tag frequency threshold: 50 occurrences across the combined dataset.

Limitations

  • Evaluated on booru-style illustrations and furry art; performance on photographic images or other art styles is untested.
  • The vocabulary reflects the biases of e621 and Danbooru annotation practices.

License

Apache 2.0

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lodestones/tagger-experiment

Finetunes
1 model

Space using lodestones/tagger-experiment 1