ππ₯ MiraToffel (German MiraTTS)
MiraToffel is a german finetune of the MiraTTS model. It generates clear, realistic speech at speeds up to 100x realtime (using batching/Lmdeploy) while maintaining high-fidelity 48kHz audio output.
Designed to be memory efficient, it runs comfortably within 6GB VRAM with latencies as low as 100ms.
Key Benefits
- β‘ Incredibly Fast: Capable of speeds over 100x realtime.
- π§ High Quality: Generates crisp 48kHz audio, surpassing standard 24kHz/32kHz models.
- πΎ Memory Efficient: Fully functional on consumer GPUs with <6GB VRAM.
- π£οΈ Voice Cloning: Zero-shot voice cloning capabilities with high resemblance to reference audio.
What Kartoffelmodel to use?
- Use the Kartoffelbox for a better voice cloning
- Use MiraTTS for a higher quality speech, pronounciation, ... , but worse voice cloning
- Kartoffelbox-Turbo is just an experimental model
Which Kartoffel-Model should I use?
Since there are multiple versions of the "Kartoffel" fine-tunes, here is a quick guide to help you choose the right one for your needs:
| Model | Best Used For | Pros | Cons |
|---|---|---|---|
| Kartoffelbox | Voice Cloning | β’ Superior voice cloning accuracy β’ Good emotion control |
β’ Lower stability & worse pronounciation than Mira |
| MiraToffel | High Fidelity Speech | β’ Better pronunciation & stability β’ Fast generation |
β’ Voice cloning is less accurate than Kartoffelbox |
| Kartoffelbox-Turbo | Experimentation | β’ Faster than base Kartoffelbox | β’ Experimental status β’ unstable |
Summary:
- Choose Kartoffelbox if your priority is sounding close to the reference speaker.
- Choose MiraToffel if your priority is clear audio quality and better German pronunciation, even if the voice match isn't 100% perfect.
Installation
You can install the optimized Mira library directly from GitHub:
uv pip install git+https://github.com/ysharma3501/MiraTTS.git
# Or using standard pip
pip install git+https://github.com/ysharma3501/MiraTTS.git
Usage
Standard Usage with lmdeploy
from mira.model import MiraTTS
from IPython.display import Audio
# Load the German Fine-tune
mira_tts = MiraTTS('SebastianBodza/MiraToffel_miraTTS_german')
reference_file = "german_reference.wav"
text = "Na, hast du schon mal von MiraToffel gehΓΆrt? Das ist echt der Wahnsinn!"
# 1. Encode the reference voice
context_tokens = mira_tts.encode_audio(reference_file)
# 2. Generate Speech
audio = mira_tts.generate(text, context_tokens)
# 3. Play/Save (Audio is 48kHz)
Audio(audio, rate=48000)
Batching Example:
text_batch = [
"Hallo! Wie geht es dir heute?",
"Ich finde diese Technologie faszinierend."
]
context_tokens = [mira_tts.encode_audio(reference_file)]
# Generate multiple sentences at once
audio_batch = mira_tts.batch_generate(text_batch, context_tokens)
Usage with transformers
If you need granular control or want to integrate into an existing transformers pipeline without the Mira wrapper, use the following script.
import torch
import numpy as np
import librosa
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
import logging
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
REPO_ID = "SebastianBodza/MiraToffel_miraTTS_german"
def _init_model(repo_id):
"""Initializes a model using Standard Transformers."""
logging.info(f"π Initializing Transformers Model from {repo_id}...")
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(repo_id)
model.to(DEVICE)
return {"model": model, "tokenizer": tokenizer}
def generate_mira_speech(model_obj, text, audio_path):
model = model_obj["model"]
tokenizer = model_obj["tokenizer"]
# 1. Load and encode reference audio
audio_array, sr = librosa.load(audio_path, sr=16000)
context_tokens = tts_codec.encode(audio_array)
# 2. Format prompt
formatted_prompt = tts_codec.format_prompt(text, context_tokens, None)
# 3. Tokenize inputs
model_inputs = tokenizer([formatted_prompt], return_tensors="pt").to(DEVICE)
# 4. Generate
generated_ids = model.generate(
**model_inputs,
max_new_tokens=1024,
do_sample=True,
temperature=0.8,
top_k=50,
top_p=1.0,
repetition_penalty=1.2,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
)
# 5. Decode output
generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
# 6. Convert back to audio
audio = tts_codec.decode(predicts_text, context_tokens)
if isinstance(audio, torch.Tensor):
audio = audio.detach().cpu().float().numpy()
elif isinstance(audio, np.ndarray):
audio = audio.astype(np.float32)
return audio, 48000
# --- Execution ---
print("Loading MiraToffel...")
model_data = _init_model(REPO_ID)
text_input = "Hallo, das ist ein Test des MiraToffel Systems."
reference_audio = "path/to/german_reference.wav"
print("Generating...")
audio_out, sample_rate = generate_mira_speech(model_data, text_input, reference_audio)
sf.write("miratoffel_output.wav", audio_out, sample_rate)
print("Saved to miratoffel_output.wav")
Acknowledgements
- Thanks to YatharthS for the training and the finetuning code of miraTTS
- Check out the GitHub Repository for finetuning notebooks and further details.
- Downloads last month
- 594
Model tree for SebastianBodza/MiraToffel_miraTTS_german
Base model
YatharthS/MiraTTS