DINOv3 ViT-H/16+ Booru Tagger

A multi-label image tagger trained on e621 and Danbooru annotations, using a DINOv3 ViT-H/16+ backbone fine-tuned end-to-end with a single linear projection head.

Model Details

Property	Value
Backbone	`facebook/dinov3-vith16plus-pretrain-lvd1689m`
Architecture	ViT-H/16+ · 32 layers · hidden dim 1280 · 20 heads · SwiGLU MLP · RoPE · 4 register tokens
Head	`Linear((1 + 4) × 1280 → 74 625)` — CLS + 4 register tokens concatenated
Vocabulary	74 625 tags (min frequency ≥ 50 across training set)
Input resolution	Any multiple of 16 px — trained at 512 px, generalises to higher resolutions
Input normalisation	ImageNet mean/std `[0.485, 0.456, 0.406]` / `[0.229, 0.224, 0.225]`
Output	Raw logits — apply `sigmoid` for per-tag probabilities
Parameters	~632 M (backbone) + ~480 M (head)

Training

Hyperparameter	Value
Training data	e621 + Danbooru (parquet)
Batch size	32
Learning rate	1e-6
Warmup steps	50
Loss	`BCEWithLogitsLoss` with per-tag `pos_weight = (neg/pos)^(1/T)`, cap 100
Optimiser	AdamW (β₁=0.9, β₂=0.999, wd=0.01)
Precision	bfloat16 (backbone) / float32 (projection + loss)
Hardware	2× GPU, ThreadPoolExecutor + NCCL all-reduce

Usage

Standalone (no `transformers` dependency)

from inference_tagger_standalone import Tagger

tagger = Tagger(
    checkpoint_path="tagger_proto.safetensors",
    vocab_path="tagger_vocab_with_categories.json",
    device="cuda",
)

tags = tagger.predict("photo.jpg", topk=40)
# → [("solo", 0.98), ("anthro", 0.95), ...]

# or threshold-based
tags = tagger.predict("https://example.com/image.jpg", threshold=0.35)

CLI

# top-30 tags, pretty output
python inference_tagger_standalone.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories.json \
    --images photo.jpg https://example.com/image.jpg \
    --topk 30

# comma-separated string (pipe into diffusion trainer)
python inference_tagger_standalone.py ... --format tags

# JSON
python inference_tagger_standalone.py ... --format json

Web UI

pip install fastapi uvicorn jinja2 aiofiles

python tagger_ui_server.py \
    --checkpoint tagger_proto.safetensors \
    --vocab tagger_vocab_with_categories.json \
    --port 7860
# → open http://localhost:7860

Files

File	Description
`*.safetensors`	Model weights (bfloat16)
`tagger_vocab_with_categories.json`	`{"idx2tag": [...]}` — 74 625 tag strings ordered by training frequency
`inference_tagger_standalone.py`	Self-contained inference script (no `transformers` dep)
`tagger_ui_server.py`	FastAPI + Jinja2 web UI server

Tag Vocabulary

Tags are sourced from e621 and Danbooru annotations and cover:

Subject — species, character count, gender (solo, duo, anthro, 1girl, male, …)
Body — anatomy, fur/scale/skin markings, body parts
Action / pose — looking at viewer, sitting, …
Scene — background, lighting, setting
Style — digital art, hi res, sketch, watercolor, …
Rating — explicit content tags are included; filter as needed for your use case

Minimum tag frequency threshold: 50 occurrences across the combined dataset.

Limitations

Evaluated on booru-style illustrations and furry art; performance on photographic images or other art styles is untested.
The vocabulary reflects the biases of e621 and Danbooru annotation practices.

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for lodestones/tagger-experiment

Finetunes

1 model

lodestones
/

tagger-experiment

DINOv3 ViT-H/16+ Booru Tagger

Model Details

Training

Usage

Standalone (no `transformers` dependency)

CLI

Web UI

Files

Tag Vocabulary

Limitations

License

Model tree for lodestones/tagger-experiment

Space using lodestones/tagger-experiment 1

DINOv3 ViT-H/16+ Booru Tagger

Model Details

Training

Usage

Standalone (no transformers dependency)

CLI

Web UI

Files

Tag Vocabulary

Limitations

License

Model tree for lodestones/tagger-experiment

Space using lodestones/tagger-experiment 1

Standalone (no `transformers` dependency)