Model Card for Model ID
Model Description
This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).
- Model type: LoRA-adapted Transformer LLM (8B params, ~32M trainable)
- License: apache-2.0
- Finetuned from model: CohereForAI/aya-expanse-8b
Model Sources
- Paper: COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing
- Demo: Integrated in Demo Portal
Uses
- NER in Hinglish pipelines (e.g., social media monitoring, news extraction).
- Example inference prompt:
Identify named entities in: "लंदन के Madame Tussauds में Deepika Padukone के wax statue का गुरुवार को अनावरण हुआ।" Output: [{'लंदन': 'GPE'}, {'के', 'X'}, {'Madame': 'ORGANISATION'}, {'Tussauds': 'ORGANISATION'}, {'में', 'X'}, {'Deepika': 'PERSON'}, {'Padukone': 'PERSON'}, {'के', 'X'}, {'wax': 'X'}, {'statue': 'X'}, {'का' : 'X'}, {'गुरुवार': 'DATE'}, {'को': 'X'}, {'अनावरण': 'X'} {'हुआ।'': 'X'}]
Training Details
Training Data
Training Procedure
Preprocessing
Tokenized with base tokenizer; instruction templates + few-shot examples. Filtered: ≥5 tokens, no hate/non-Hinglish.
Training Hyperparameters
- Regime: PEFT LoRA (rank=32, alpha=64, dropout=0.1)
- Epochs: 3
- Batch: 4 (accum=8, effective=32)
- LR: 2e-4 (cosine+warmup=0.1)
- Weight decay: 0.01
Evaluation
Testing Data
COMI-LINGUA NER test set (5K).
Metrics
Macro P/R/F1 (token-level).
Results
| Setting | P | R | F1 |
|---|---|---|---|
| Zero-shot | 54.47 | 68.27 | 59.88 |
| One-shot | 79.73 | 81.44 | 79.18 |
| Fine-tuned | 94.94 | 94.91 | 94.90 |
Summary: SOTA for Hinglish NER; 94.94 F1 on fine-tuned version of aya-expanse-8b.
Bias, Risks, and Limitations
This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.
Model Card Contact
Lingo Research Group at IIT Gandhinagar, India
Mail at: lingo@iitgn.ac.in
Citation
If you use this model, please cite the following work:
@inproceedings{sheth-etal-2025-comi,
title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
author = "Sheth, Rajvee and
Beniwal, Himanshu and
Singh, Mayank",
editor = "Christodoulopoulos, Christos and
Chakraborty, Tanmoy and
Rose, Carolyn and
Peng, Violet",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
month = nov,
year = "2025",
address = "Suzhou, China",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.findings-emnlp.422/",
pages = "7973--7992",
ISBN = "979-8-89176-335-7",
}
- Downloads last month
- 12
Model tree for LingoIITGN/COMI-LINGUA-NER
Base model
CohereLabs/aya-expanse-8b