---
base_model: CohereForAI/aya-expanse-8b
library_name: peft
license: apache-2.0
datasets:
- LingoIITGN/COMI-LINGUA
language:
- hi
- en
pipeline_tag: token-classification
tags:
- code-mixing
- Hinglish
metrics:
- precision
- recall
- f1
---

# Model Card for Model ID

### Model Description

This is a fine-tuned version of aya-expanse-8b for Named Entity Recognition (NER) on Hinglish (Hindi-English code-mixed) text. It helps with token-level entity tagging (PERSON, ORGANISATION, LOCATION, DATE, TIME, GPE, HASHTAG, EMOJI, MENTION, X/Other) in Roman/Devanagari scripts. 
Achieves 94.90 F1 on COMI-LINGUA test set (5K instances), outperforming the zero-shot inference (59.88 F1).

- **Model type:** LoRA-adapted Transformer LLM (8B params, ~32M trainable)
- **License:** apache-2.0
- **Finetuned from model:** CohereForAI/aya-expanse-8b

### Model Sources
- **Paper:** [COMI-LINGUA: Expert Annotated Large-Scale Dataset for Multitask NLP in Hindi-English Code-Mixing](https://aclanthology.org/2025.findings-emnlp.422.pdf)
- **Demo:** Integrated in [Demo Portal](https://lingo.iitgn.ac.in/comi-lingua/) 

## Uses

- NER in Hinglish pipelines (e.g., social media monitoring, news extraction).
- Example inference prompt:  
  ```
  Identify named entities in: "लंदन के Madame Tussauds में Deepika Padukone के wax statue का गुरुवार को अनावरण हुआ।"
  Output: [{'लंदन': 'GPE'}, {'के', 'X'}, {'Madame': 'ORGANISATION'}, {'Tussauds': 'ORGANISATION'}, {'में', 'X'}, {'Deepika': 'PERSON'}, {'Padukone': 'PERSON'}, {'के', 'X'}, {'wax': 'X'}, {'statue': 'X'}, {'का' : 'X'}, {'गुरुवार': 'DATE'}, {'को': 'X'}, {'अनावरण': 'X'} {'हुआ।'': 'X'}]
  ```

## Training Details
### Training Data
[COMI-LINGUA Dataset Card](https://huggingface.co/datasets/LingoIITGN/COMI-LINGUA).

### Training Procedure
#### Preprocessing
Tokenized with base tokenizer; instruction templates + few-shot examples. Filtered: ≥5 tokens, no hate/non-Hinglish.

#### Training Hyperparameters
- **Regime:** PEFT LoRA (rank=32, alpha=64, dropout=0.1)
- **Epochs:** 3
- **Batch:** 4 (accum=8, effective=32)
- **LR:** 2e-4 (cosine+warmup=0.1)
- **Weight decay:** 0.01

## Evaluation

#### Testing Data
COMI-LINGUA NER test set (5K).

#### Metrics
Macro P/R/F1 (token-level).

### Results
| Setting | P    | R    | F1   |
|---------|------|------|------|
| Zero-shot | 54.47 | 68.27 | 59.88 |
| One-shot | 79.73 | 81.44 | 79.18 |
| **Fine-tuned** | **94.94** | **94.91** | **94.90** |

**Summary:** SOTA for Hinglish NER; 94.94 F1 on fine-tuned version of aya-expanse-8b.

## Bias, Risks, and Limitations

<span style="color:red"> This model is a research preview and is subject to ongoing iterative updates. As such, it provides only limited safety measures.</span>


## Model Card Contact

[Lingo Research Group at IIT Gandhinagar, India](https://labs.iitgn.ac.in/lingo/) </br> 
Mail at: [lingo@iitgn.ac.in](lingo@iitgn.ac.in)

## Citation

If you use this model, please cite the following work:
```
@inproceedings{sheth-etal-2025-comi,
    title = "{COMI}-{LINGUA}: Expert Annotated Large-Scale Dataset for Multitask {NLP} in {H}indi-{E}nglish Code-Mixing",
    author = "Sheth, Rajvee  and
      Beniwal, Himanshu  and
      Singh, Mayank",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.findings-emnlp.422/",
    pages = "7973--7992",
    ISBN = "979-8-89176-335-7",
}
```