pii-modernbert-large-v2
PII / PHI named-entity recognition fine-tune of
answerdotai/ModernBERT-large on the
v2 harmonized + synthetic-augmented English-only corpus at
Vrandan/pii-harmonized-corpus-v2.
Eval (held-out test split)
| Metric | Value |
|---|---|
| F1 | 0.5975 |
| Precision | 0.5341 |
| Recall | 0.6780 |
Architecture vs v1
- 46 ML labels (was 49) — 3 dropped to regex layer (HTTP_COOKIE, MAC_ADDRESS, BLOOD_TYPE)
- BILOU tagging (was BIO) — 1 + 4×46 = 185 labels
- English-only training data (v1 was multilingual leaking)
- Rebalanced spans: PERSON capped at 150K, rare classes floored
- Synthetic data blended (Tier-A six-failure-mode + Tier-B malformation training)
- Class weight cap 1.5× mean (was 3×)
Recipe
- Context: native 8192 tokens (ModernBERT alternating attention)
- Optimizer: AdamW (fused on CUDA), cosine LR, peak 2e-05, warmup 0.1
- Effective batch: 32 (per-device 8 × grad-accum 4 × world 1)
- Precision: bf16, gradient checkpointing (use_reentrant=False), SDPA
- Loss: class-weighted CE, capped at 1.5× mean (rebalanced data)
- Epochs: 3, early-stop patience 4
Inference
from transformers import pipeline
pii = pipeline(
"token-classification",
model="Vrandan/pii-modernbert-large-v2",
aggregation_strategy="simple",
)
pii("Patient John Smith, MRN-2024-88432, called 555-FAKE-1234 about Rx refill.")
Hybrid pipeline note
This model is the ML half of a regex+ML hybrid. Inference output should be merged with the regex layer for the 3 dropped labels (HTTP_COOKIE, MAC_ADDRESS, BLOOD_TYPE) which the model is trained to NOT fire on.
- Downloads last month
- 9
Model tree for Vrandan/pii-modernbert-large-v2
Base model
answerdotai/ModernBERT-large