Is there a way to encode voices directly?

#12

by zoharcozmox - opened May 30, 2025

May 30, 2025

Can we hardcode voices in the model while initializing and make the model to inference faster currently takes around 2 seconds to clone a sentence.

Kwissbeats

May 30, 2025

I was thinking exactly the same, Gemini found a solution in single shot for me editing tts.py.
But i usually do not sling ai slop on the internet, and can't really be bother to check it thoroughly.

what I asked was: "Problem: prepare_conditionals is called every time generate is run with an audio_prompt_path. This involves librosa loading/resampling and model inference (VE, S3Gen tokenizer). I frequently use the same audio_prompt_path, I'd like to cache the computed Conditionals object. Can you help me with that? here is the code:"

It came up with a solution for me that if I "warm-up" the model with a reference audio file its faster on subsequent requests.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment