YAML Metadata Warning: empty or missing yaml metadata in repo card
Check out the documentation for more information.
SpeD-Ro_6gram-tokens-prune0135 is a Romanian token-based 6-gram language model built using KenLM and integrated within the NVIDIA NeMo framework. The model is trained using this script on approximately 27 million text lines and is designed to provide an efficient and compact statistical language model for Romanian, particularly suited for ASR decoding pipelines and text probability estimation. We pruned the 6-gram model to keep it practical and efficient. Many high-order n-grams appear only a handful of times and add a lot of size without much benefit. With the [0, 1, 3, 5] pruning scheme, we remove these rare patterns and keep the ones that are seen often enough to be useful, making the model much smaller while preserving its core behavior.
๐ Citation
Citation
If you use this model, please cite:
@misc{pirlogeanu2025opensourcestateoftheartsolution,
title={Open Source State-Of-the-Art Solution for Romanian Speech Recognition},
author={Gabriel Pirlogeanu and Alexandru-Lucian Georgescu and Horia Cucu},
year={2025},
eprint={2511.03361},
archivePrefix={arXiv},
primaryClass={eess.AS},
url={https://arxiv.org/abs/2511.03361},
}
Also consider citing the original NVIDIA NeMo framework and KenLM:
@article{kuchaiev2019nemo,
title={NeMo: a toolkit for building AI applications using Neural Modules},
author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
journal={arXiv preprint arXiv:1909.09577},
year={2019}
}
@inproceedings{heafield-2011-kenlm,
title = "{K}en{LM}: Faster and Smaller Language Model Queries",
author = "Heafield, Kenneth",
editor = "Callison-Burch, Chris and
Koehn, Philipp and
Monz, Christof and
Zaidan, Omar F.",
booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
month = jul,
year = "2011",
address = "Edinburgh, Scotland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W11-2123/",
pages = "187--197"
}
Contact
For questions or collaborations: gabriel.pirlogeanu@gmail.com
license: apache-2.0