You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

🚀🥔 MiraToffel (German MiraTTS)

MiraToffel is a german finetune of the MiraTTS model. It generates clear, realistic speech at speeds up to 100x realtime (using batching/Lmdeploy) while maintaining high-fidelity 48kHz audio output.
Designed to be memory efficient, it runs comfortably within 6GB VRAM with latencies as low as 100ms.

Key Benefits

⚡ Incredibly Fast: Capable of speeds over 100x realtime.
🎧 High Quality: Generates crisp 48kHz audio, surpassing standard 24kHz/32kHz models.
💾 Memory Efficient: Fully functional on consumer GPUs with <6GB VRAM.
🗣️ Voice Cloning: Zero-shot voice cloning capabilities with high resemblance to reference audio.

What Kartoffelmodel to use?

Use the Kartoffelbox for a better voice cloning
Use MiraTTS for a higher quality speech, pronounciation, ... , but worse voice cloning
Kartoffelbox-Turbo is just an experimental model

Which Kartoffel-Model should I use?

Since there are multiple versions of the "Kartoffel" fine-tunes, here is a quick guide to help you choose the right one for your needs:

Model	Best Used For	Pros	Cons
Kartoffelbox	Voice Cloning	• Superior voice cloning accuracy • Good emotion control	• Lower stability & worse pronounciation than Mira
MiraToffel	High Fidelity Speech	• Better pronunciation & stability • Fast generation	• Voice cloning is less accurate than Kartoffelbox
Kartoffelbox-Turbo	Experimentation	• Faster than base Kartoffelbox	• Experimental status • unstable

Summary:

Choose Kartoffelbox if your priority is sounding close to the reference speaker.
Choose MiraToffel if your priority is clear audio quality and better German pronunciation, even if the voice match isn't 100% perfect.

Installation

You can install the optimized Mira library directly from GitHub:

uv pip install git+https://github.com/ysharma3501/MiraTTS.git
# Or using standard pip
pip install git+https://github.com/ysharma3501/MiraTTS.git

Usage

Standard Usage with lmdeploy

from mira.model import MiraTTS
from IPython.display import Audio

# Load the German Fine-tune
mira_tts = MiraTTS('SebastianBodza/MiraToffel_miraTTS_german') 
reference_file = "german_reference.wav" 

text = "Na, hast du schon mal von MiraToffel gehört? Das ist echt der Wahnsinn!"

# 1. Encode the reference voice
context_tokens = mira_tts.encode_audio(reference_file)

# 2. Generate Speech
audio = mira_tts.generate(text, context_tokens)

# 3. Play/Save (Audio is 48kHz)
Audio(audio, rate=48000)

Batching Example:

text_batch = [
    "Hallo! Wie geht es dir heute?", 
    "Ich finde diese Technologie faszinierend."
]

context_tokens = [mira_tts.encode_audio(reference_file)]

# Generate multiple sentences at once
audio_batch = mira_tts.batch_generate(text_batch, context_tokens)

Usage with transformers

If you need granular control or want to integrate into an existing transformers pipeline without the Mira wrapper, use the following script.

import torch
import numpy as np
import librosa
import soundfile as sf
from transformers import AutoTokenizer, AutoModelForCausalLM
import logging

DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
REPO_ID = "SebastianBodza/MiraToffel_miraTTS_german" 

def _init_model(repo_id):
    """Initializes a model using Standard Transformers."""
    logging.info(f"🚀 Initializing Transformers Model from {repo_id}...")
    tokenizer = AutoTokenizer.from_pretrained(repo_id)
    model = AutoModelForCausalLM.from_pretrained(repo_id)
    model.to(DEVICE)    
    return {"model": model, "tokenizer": tokenizer}

def generate_mira_speech(model_obj, text, audio_path):
    model = model_obj["model"]
    tokenizer = model_obj["tokenizer"]   

    # 1. Load and encode reference audio
    audio_array, sr = librosa.load(audio_path, sr=16000)
    context_tokens = tts_codec.encode(audio_array)   

    # 2. Format prompt
    formatted_prompt = tts_codec.format_prompt(text, context_tokens, None)   

    # 3. Tokenize inputs
    model_inputs = tokenizer([formatted_prompt], return_tensors="pt").to(DEVICE)   

    # 4. Generate
    generated_ids = model.generate(
        **model_inputs,
        max_new_tokens=1024,
        do_sample=True,
        temperature=0.8,
        top_k=50,
        top_p=1.0,
        repetition_penalty=1.2,
        eos_token_id=tokenizer.eos_token_id,
        pad_token_id=tokenizer.pad_token_id
    )  

    # 5. Decode output
    generated_ids_trimmed = generated_ids[:, model_inputs.input_ids.shape[1]:]
    predicts_text = tokenizer.batch_decode(generated_ids_trimmed, skip_special_tokens=False)[0]
    
    # 6. Convert back to audio
    audio = tts_codec.decode(predicts_text, context_tokens)

    if isinstance(audio, torch.Tensor):
        audio = audio.detach().cpu().float().numpy()
    elif isinstance(audio, np.ndarray):
        audio = audio.astype(np.float32)     
    return audio, 48000

# --- Execution ---
print("Loading MiraToffel...")
model_data = _init_model(REPO_ID)

text_input = "Hallo, das ist ein Test des MiraToffel Systems."
reference_audio = "path/to/german_reference.wav"

print("Generating...")
audio_out, sample_rate = generate_mira_speech(model_data, text_input, reference_audio)

sf.write("miratoffel_output.wav", audio_out, sample_rate)

print("Saved to miratoffel_output.wav")

Acknowledgements

Thanks to YatharthS for the training and the finetuning code of miraTTS
Check out the GitHub Repository for finetuning notebooks and further details.

Downloads last month: 594

Safetensors

Model size

0.5B params

Tensor type

F32

Model tree for SebastianBodza/MiraToffel_miraTTS_german

Base model

YatharthS/MiraTTS

Finetuned

(10)

this model