VoxCPM-1.5B-HI-LORA
The TTS model was trained with the ai4bharat/indicvoices_r dataset on the hindi subset where the audio clips are between 4-12s after resampling to 44.1Khz. Did try to keep the male and audio data to about 50%. (Roughly 32 hours) I think more data is required for better accuracy. Voice cloning might feel a bit less accurate if you try to generate longer audio sequence which is expected as the input data is maximum 12s. Checkpoint at iteration 8000 (Was not seeing much improvement and since, this is just an experimentation, didn't go further)
Device: NVIDIA RTX4070 Super 12GB
How-tos
Finetune a LoRa:
Follow the original guide here
We have trained text embedding layers as well apart from the lora guide mentioned there. Full finetuning should generally give better results
Inference:
Setup:
git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
# recommended to use a virtual environment
pip install voxcpm
# for voice cloning, I have to install torchcodec.
pip install torchcodec==0.9
Make sure torchcodec compatible with Pytoch & Python. You can check compatibility here
Download this checkpoint, I recommended using some lines of Python code:
from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM1.5", local_dir="./pretrained/VoxCPM-1.5")
snapshot_download("darknight054/VoxCPM-1.5B-HI-LORA", local_dir="./pretrained/VoxCPM-1.5B-HI-LORA")
Once done, you will get the lora_config.json in there which will have a key called base_model, here set the absolute path to the directory you download the base model.
Inferences with the checkpoints:
python scripts/test_voxcpm_lora_infer.py \
--ckpt_dir ./pretrained/VoxCPM-1.5B-HI-LORA \
--text "किताबों के अलावा ऐसे कई पत्रिका ब्लॉग या समाचार पत्र हैं जिसे हम पढ़ते हैं" \
--prompt_audio /path/to/reference.wav \
--prompt_text "Reference audio transcript (hindi)" \
--output cloned_output.wav
⚠️ Disclaimer & Warnings for Use (TTS)
This Text-to-Speech (TTS) model is provided solely for research, testing, and technological development purposes . Any audio content generated by the model does not represent the voice, identity, views, or consent of any real individual or organization . The authors and related parties are not liable for any misuse, illegal activities, infringement of privacy, personal rights, intellectual property rights, or direct or indirect damages arising from the use of this model.
Users have full rights and legal responsibility for the deployment, distribution, and use of the model. The use of the model for impersonation, unauthorized copying of personal voices, creating misleading content, fraud, manipulation of public opinion, or any purpose contrary to current laws is strictly prohibited . When using or sharing generated audio, it is recommended that the content be clearly disclosed as AI-generated audio and that all relevant legal regulations, platform policies, and ethical standards be fully complied with.
Acknowledgements
@inproceedings{ai4bharat2024indicvoices_r,
author = {Ashwin Sankar and
Srija Anand and
Praveen Srinivasa Varadhan and
Sherry Thomas and
Mehak Singal and
Shridhar Kumar and
Deovrat Mehendale and
Aditi Krishana and
Giri Raju and
Mitesh M. Khapra},
editor = {Amir Globersons and
Lester Mackey and
Danielle Belgrave and
Angela Fan and
Ulrich Paquet and
Jakub M. Tomczak and
Cheng Zhang},
title = {IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech
Corpus for Scaling Indian {TTS}},
booktitle = {Advances in Neural Information Processing Systems 38: Annual Conference
on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver,
BC, Canada, December 10 - 15, 2024},
year = {2024},
url = {http://papers.nips.cc/paper\_files/paper/2024/hash/7dfcaf4512bbf2a807a783b90afb6c09-Abstract-Datasets\_and\_Benchmarks\_Track.html},
}