VoxCPM-1.5B-HI-LORA

The TTS model was trained with the ai4bharat/indicvoices_r dataset on the hindi subset where the audio clips are between 4-12s after resampling to 44.1Khz. Did try to keep the male and audio data to about 50%. (Roughly 32 hours) I think more data is required for better accuracy. Voice cloning might feel a bit less accurate if you try to generate longer audio sequence which is expected as the input data is maximum 12s. Checkpoint at iteration 8000 (Was not seeing much improvement and since, this is just an experimentation, didn't go further)

Device: NVIDIA RTX4070 Super 12GB

How-tos

Finetune a LoRa:

Follow the original guide here

We have trained text embedding layers as well apart from the lora guide mentioned there. Full finetuning should generally give better results

Inference:

Setup:

git clone https://github.com/OpenBMB/VoxCPM.git
cd VoxCPM
# recommended to use a virtual environment
pip install voxcpm
# for voice cloning, I have to install torchcodec.
pip install torchcodec==0.9

Make sure torchcodec compatible with Pytoch & Python. You can check compatibility here

Download this checkpoint, I recommended using some lines of Python code:

from huggingface_hub import snapshot_download
snapshot_download("openbmb/VoxCPM1.5", local_dir="./pretrained/VoxCPM-1.5")
snapshot_download("darknight054/VoxCPM-1.5B-HI-LORA", local_dir="./pretrained/VoxCPM-1.5B-HI-LORA")

Once done, you will get the lora_config.json in there which will have a key called base_model, here set the absolute path to the directory you download the base model.

Inferences with the checkpoints:

python scripts/test_voxcpm_lora_infer.py \
    --ckpt_dir  ./pretrained/VoxCPM-1.5B-HI-LORA \
    --text "किताबों के अलावा ऐसे कई पत्रिका ब्लॉग या समाचार पत्र हैं जिसे हम पढ़ते हैं" \
    --prompt_audio /path/to/reference.wav \
    --prompt_text "Reference audio transcript (hindi)" \
    --output cloned_output.wav

⚠️ Disclaimer & Warnings for Use (TTS)

This Text-to-Speech (TTS) model is provided solely for research, testing, and technological development purposes . Any audio content generated by the model does not represent the voice, identity, views, or consent of any real individual or organization . The authors and related parties are not liable for any misuse, illegal activities, infringement of privacy, personal rights, intellectual property rights, or direct or indirect damages arising from the use of this model.

Users have full rights and legal responsibility for the deployment, distribution, and use of the model. The use of the model for impersonation, unauthorized copying of personal voices, creating misleading content, fraud, manipulation of public opinion, or any purpose contrary to current laws is strictly prohibited . When using or sharing generated audio, it is recommended that the content be clearly disclosed as AI-generated audio and that all relevant legal regulations, platform policies, and ethical standards be fully complied with.

Acknowledgements

@inproceedings{ai4bharat2024indicvoices_r,
  author       = {Ashwin Sankar and
                  Srija Anand and
                  Praveen Srinivasa Varadhan and
                  Sherry Thomas and
                  Mehak Singal and
                  Shridhar Kumar and
                  Deovrat Mehendale and
                  Aditi Krishana and
                  Giri Raju and
                  Mitesh M. Khapra},
  editor       = {Amir Globersons and
                  Lester Mackey and
                  Danielle Belgrave and
                  Angela Fan and
                  Ulrich Paquet and
                  Jakub M. Tomczak and
                  Cheng Zhang},
  title        = {IndicVoices-R: Unlocking a Massive Multilingual Multi-speaker Speech
                  Corpus for Scaling Indian {TTS}},
  booktitle    = {Advances in Neural Information Processing Systems 38: Annual Conference
                  on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver,
                  BC, Canada, December 10 - 15, 2024},
  year         = {2024},
  url          = {http://papers.nips.cc/paper\_files/paper/2024/hash/7dfcaf4512bbf2a807a783b90afb6c09-Abstract-Datasets\_and\_Benchmarks\_Track.html},
}

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for darknight054/VoxCPM-1.5B-HI-LORA

Finetuned
(7)
this model

Dataset used to train darknight054/VoxCPM-1.5B-HI-LORA