You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

pii-modernbert-large-v2

PII / PHI named-entity recognition fine-tune of answerdotai/ModernBERT-large on the v2 harmonized + synthetic-augmented English-only corpus at Vrandan/pii-harmonized-corpus-v2.

Eval (held-out test split)

Metric Value
F1 0.5975
Precision 0.5341
Recall 0.6780

Architecture vs v1

  • 46 ML labels (was 49) — 3 dropped to regex layer (HTTP_COOKIE, MAC_ADDRESS, BLOOD_TYPE)
  • BILOU tagging (was BIO) — 1 + 4×46 = 185 labels
  • English-only training data (v1 was multilingual leaking)
  • Rebalanced spans: PERSON capped at 150K, rare classes floored
  • Synthetic data blended (Tier-A six-failure-mode + Tier-B malformation training)
  • Class weight cap 1.5× mean (was 3×)

Recipe

  • Context: native 8192 tokens (ModernBERT alternating attention)
  • Optimizer: AdamW (fused on CUDA), cosine LR, peak 2e-05, warmup 0.1
  • Effective batch: 32 (per-device 8 × grad-accum 4 × world 1)
  • Precision: bf16, gradient checkpointing (use_reentrant=False), SDPA
  • Loss: class-weighted CE, capped at 1.5× mean (rebalanced data)
  • Epochs: 3, early-stop patience 4

Inference

from transformers import pipeline
pii = pipeline(
    "token-classification",
    model="Vrandan/pii-modernbert-large-v2",
    aggregation_strategy="simple",
)
pii("Patient John Smith, MRN-2024-88432, called 555-FAKE-1234 about Rx refill.")

Hybrid pipeline note

This model is the ML half of a regex+ML hybrid. Inference output should be merged with the regex layer for the 3 dropped labels (HTTP_COOKIE, MAC_ADDRESS, BLOOD_TYPE) which the model is trained to NOT fire on.

Downloads last month
9
Safetensors
Model size
0.4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Vrandan/pii-modernbert-large-v2

Finetuned
(263)
this model

Dataset used to train Vrandan/pii-modernbert-large-v2