Qwen3 Voice Embedding
Collection
Standalone ECAPA-TDNN x-vector speaker encoders extracted from Qwen3-TTS. 1024-dim (0.6B) and 2048-dim (1.7B). • 4 items • Updated
• 27
ONNX exports of the Qwen3-Voice-Embedding-12Hz-0.6B ECAPA-TDNN speaker encoder. Produces 1024-dimensional x-vector speaker embeddings from audio.
Three quantization variants are provided for different deployment targets:
| File | Format | Size | Use case |
|---|---|---|---|
speaker_encoder_fp32.onnx |
Float32 | 35 MB | Maximum accuracy |
speaker_encoder_fp16.onnx |
Float16 | 18 MB | Browser / GPU inference (recommended) |
speaker_encoder_int8.onnx |
Int8 | 9 MB | Edge / mobile / minimal footprint |
import numpy as np
import onnxruntime as ort
import librosa
# Load model
session = ort.InferenceSession("speaker_encoder_fp32.onnx")
# Compute mel spectrogram (must match training preprocessing)
audio, sr = librosa.load("audio.wav", sr=24000, mono=True)
mel = librosa.feature.melspectrogram(
y=audio, sr=24000, n_fft=1024, hop_length=256,
n_mels=128, fmin=0, fmax=12000,
)
mel = np.log(np.clip(mel, a_min=1e-5, a_max=None))
mel = mel.T[np.newaxis, ...] # (1, time, 128)
# Run inference
embedding = session.run(None, {"mel_spectrogram": mel.astype(np.float32)})[0]
# embedding shape: (1, 1024)
import * as ort from "onnxruntime-web";
const session = await ort.InferenceSession.create("speaker_encoder_fp16.onnx");
// mel: Float32Array of shape [1, time_steps, 128]
const tensor = new ort.Tensor("float32", mel, [1, timeSteps, 128]);
const results = await session.run({ mel_spectrogram: tensor });
const embedding = results.speaker_embedding.data; // Float32Array(1024)
| Name | Shape | Type | |
|---|---|---|---|
| Input | mel_spectrogram |
(batch, time, 128) |
float32 |
| Output | speaker_embedding |
(batch, 1024) |
float32 |
The time axis is dynamic — any length mel spectrogram is accepted.
| Parameter | Value |
|---|---|
| Sample rate | 24000 Hz |
| FFT size | 1024 |
| Hop length | 256 |
| Mel bins | 128 |
| Frequency range | 0 - 12000 Hz |
| Mel scale | Slaney |
| Compression | log(clamp(x, min=1e-5)) |
Exported with torch.onnx.export (opset 18) from the standalone PyTorch model. Verified against PyTorch output (cosine similarity > 0.9999).
The export script is available at: scripts/export_onnx.py
@article{qwen3-tts,
title={Qwen3-TTS Technical Report},
author={Hu, Hangrui and Zhu, Xinfa and He, Ting and Guo, Dake and Zhang, Bin and Wang, Xiong and Guo, Zhifang and Jiang, Ziyue and Hao, Hongkun and Guo, Zishan and Zhang, Xinyu and Zhang, Pei and Yang, Baosong and Xu, Jin and Zhou, Jingren and Lin, Junyang},
journal={arXiv preprint arXiv:2601.15621},
year={2026}
}
@article{ecapa-tdnn,
title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
journal={Proc. Interspeech},
year={2020}
}
Base model
Qwen/Qwen3-TTS-12Hz-0.6B-Base