Instructions to use ResembleAI/chatterbox with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Chatterbox
How to use ResembleAI/chatterbox with Chatterbox:
# pip install chatterbox-tts import torchaudio as ta from chatterbox.tts import ChatterboxTTS model = ChatterboxTTS.from_pretrained(device="cuda") text = "Ezreal and Jinx teamed up with Ahri, Yasuo, and Teemo to take down the enemy's Nexus in an epic late-game pentakill." wav = model.generate(text) ta.save("test-1.wav", wav, model.sr) # If you want to synthesize with a different voice, specify the audio prompt AUDIO_PROMPT_PATH="YOUR_FILE.wav" wav = model.generate(text, audio_prompt_path=AUDIO_PROMPT_PATH) ta.save("test-2.wav", wav, model.sr) - Inference
- Notebooks
- Google Colab
- Kaggle
Is there a way to encode voices directly?
Can we hardcode voices in the model while initializing and make the model to inference faster currently takes around 2 seconds to clone a sentence.
I was thinking exactly the same, Gemini found a solution in single shot for me editing tts.py.
But i usually do not sling ai slop on the internet, and can't really be bother to check it thoroughly.
what I asked was: "Problem: prepare_conditionals is called every time generate is run with an audio_prompt_path. This involves librosa loading/resampling and model inference (VE, S3Gen tokenizer). I frequently use the same audio_prompt_path, I'd like to cache the computed Conditionals object. Can you help me with that? here is the code:"
It came up with a solution for me that if I "warm-up" the model with a reference audio file its faster on subsequent requests.