Instructions to use shb777/csm-maya-exp2 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use shb777/csm-maya-exp2 with Transformers:
# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("shb777/csm-maya-exp2", dtype="auto") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use shb777/csm-maya-exp2 with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shb777/csm-maya-exp2 to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for shb777/csm-maya-exp2 to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for shb777/csm-maya-exp2 to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="shb777/csm-maya-exp2", max_seq_length=2048, )
CSM Maya TTS
This is a finetuned sesame/csm that sounds like the demo.
Try it out at TinkerSpace HF Space.
Samples
Inference
Use
speaker_id=4only
import torch
from peft import PeftModel
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "sesame/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model = PeftModel.from_pretrained(model, "shb777/csm-maya-exp2")
conversation = [
{"role": "4", "content": [{"type": "text", "text": "Hey there, I am Maya."}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
gen_kwargs = {
"max_new_tokens": 375,
# "do_sample": True,
# "temperature": 0.7,
# "depth_decoder_do_sample": True,
# "depth_decoder_temperature": 0.7,
# "depth_decoder_top_k": 20,
# "depth_decoder_top_p": 0.95,
}
audio = model.generate(**inputs, **gen_kwargs, output_audio=True)
processor.save_audio(audio, "example.wav")
Training
Raw data was processed using a 5 step Emilia-Pipe like custom pipeline.
- Parakeet v2 was used for STT
- VAD chunking algo was tweaked for cleaner cuts, each clip upto 20s
- Chunked clips were filtered by UTMOSv2 score with additional filtering to remove clips with artifacts
- About 30% of the collected data was used for training (around 31 hours)
I have another chunking algorithm that uses stable-whisper for forced alignment and produces a better mixture of small + large (upto 30s) clips, but is too slow to run locally. I will leave that + full data to a future training run on the cloud.
Some observations:
- Inconsistent voice with same speaker ID (this is expected as its a base model)
- Noise at the end of generated audio (reduces with finetuning, especially with longer clips)
- Speaker ID
40onwards seems bad - Struggles with
(and)and"and"and;and?!and[and]and/(also seen in official demo, I guess this is due to the nature of sesame's preprocessing)
I ran several ablations with about 4 hours of data to find the best parameters and understand more about the model.
- Framework: Unsloth (SFT)
- LoRA Target:
attn+mlpin backbone and decoder excluding codec - LoRA Rank:
16 - LoRA Alpha:
32 - Learning Rate:
1e-4,0.1warmup withcosinescheduler - Optimizer:
adamw_torch_fused - Epochs:
4 - Batch Size:
8
Limitations
- The real strength of the model (and the reason it was designed) is multi-turn conversation with audio context. Since most of the data was single-turn, it may not generalize as well as using full duplex training data.
- The model struggles with certain characters.
- There might be some noise at the end of some generated clips.
Acknowledgements
- Base TTS for the example sentences
- Blog Post by Thomas Wolf
and Sesame's own blog
License
This is meant for research and personal use only. The license is due to the source of the training data.