YAML Metadata Warning: empty or missing yaml metadata in repo card

Check out the documentation for more information.

SpeD-Ro_6gram-tokens-prune0135 is a Romanian token-based 6-gram language model built using KenLM and integrated within the NVIDIA NeMo framework. The model is trained using this script on approximately 27 million text lines and is designed to provide an efficient and compact statistical language model for Romanian, particularly suited for ASR decoding pipelines and text probability estimation. We pruned the 6-gram model to keep it practical and efficient. Many high-order n-grams appear only a handful of times and add a lot of size without much benefit. With the [0, 1, 3, 5] pruning scheme, we remove these rare patterns and keep the ones that are seen often enough to be useful, making the model much smaller while preserving its core behavior.

📄 Citation

Citation

If you use this model, please cite:

@misc{pirlogeanu2025opensourcestateoftheartsolution,
      title={Open Source State-Of-the-Art Solution for Romanian Speech Recognition}, 
      author={Gabriel Pirlogeanu and Alexandru-Lucian Georgescu and Horia Cucu},
      year={2025},
      eprint={2511.03361},
      archivePrefix={arXiv},
      primaryClass={eess.AS},
      url={https://arxiv.org/abs/2511.03361}, 
}

Also consider citing the original NVIDIA NeMo framework and KenLM:

@article{kuchaiev2019nemo,
  title={NeMo: a toolkit for building AI applications using Neural Modules},
  author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
  journal={arXiv preprint arXiv:1909.09577},
  year={2019}
}

@inproceedings{heafield-2011-kenlm,
    title = "{K}en{LM}: Faster and Smaller Language Model Queries",
    author = "Heafield, Kenneth",
    editor = "Callison-Burch, Chris  and
      Koehn, Philipp  and
      Monz, Christof  and
      Zaidan, Omar F.",
    booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
    month = jul,
    year = "2011",
    address = "Edinburgh, Scotland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W11-2123/",
    pages = "187--197"
}

Contact

For questions or collaborations: gabriel.pirlogeanu@gmail.com

license: apache-2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Papers for gabrielpirlo/SpeD-Ro_6gram-tokens-prune0135

Open Source State-Of-the-Art Solution for Romanian Speech Recognition

Paper • 2511.03361 • Published Nov 5, 2025

NeMo: a toolkit for building AI applications using Neural Modules

Paper • 1909.09577 • Published Sep 14, 2019