Qwen3-Voice-Embedding-12Hz-0.6B (ONNX)

ONNX exports of the Qwen3-Voice-Embedding-12Hz-0.6B ECAPA-TDNN speaker encoder. Produces 1024-dimensional x-vector speaker embeddings from audio.

Three quantization variants are provided for different deployment targets:

File Format Size Use case
speaker_encoder_fp32.onnx Float32 35 MB Maximum accuracy
speaker_encoder_fp16.onnx Float16 18 MB Browser / GPU inference (recommended)
speaker_encoder_int8.onnx Int8 9 MB Edge / mobile / minimal footprint

Usage

Python (ONNX Runtime)

import numpy as np
import onnxruntime as ort
import librosa

# Load model
session = ort.InferenceSession("speaker_encoder_fp32.onnx")

# Compute mel spectrogram (must match training preprocessing)
audio, sr = librosa.load("audio.wav", sr=24000, mono=True)
mel = librosa.feature.melspectrogram(
    y=audio, sr=24000, n_fft=1024, hop_length=256,
    n_mels=128, fmin=0, fmax=12000,
)
mel = np.log(np.clip(mel, a_min=1e-5, a_max=None))
mel = mel.T[np.newaxis, ...]  # (1, time, 128)

# Run inference
embedding = session.run(None, {"mel_spectrogram": mel.astype(np.float32)})[0]
# embedding shape: (1, 1024)

Browser (ONNX Runtime Web)

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create("speaker_encoder_fp16.onnx");

// mel: Float32Array of shape [1, time_steps, 128]
const tensor = new ort.Tensor("float32", mel, [1, timeSteps, 128]);
const results = await session.run({ mel_spectrogram: tensor });
const embedding = results.speaker_embedding.data; // Float32Array(1024)

Input / Output

Name Shape Type
Input mel_spectrogram (batch, time, 128) float32
Output speaker_embedding (batch, 1024) float32

The time axis is dynamic — any length mel spectrogram is accepted.

Audio Preprocessing

Parameter Value
Sample rate 24000 Hz
FFT size 1024
Hop length 256
Mel bins 128
Frequency range 0 - 12000 Hz
Mel scale Slaney
Compression log(clamp(x, min=1e-5))

Export Details

Exported with torch.onnx.export (opset 18) from the standalone PyTorch model. Verified against PyTorch output (cosine similarity > 0.9999).

The export script is available at: scripts/export_onnx.py


Related Models


Citation

@article{qwen3-tts,
  title={Qwen3-TTS Technical Report},
  author={Hu, Hangrui and Zhu, Xinfa and He, Ting and Guo, Dake and Zhang, Bin and Wang, Xiong and Guo, Zhifang and Jiang, Ziyue and Hao, Hongkun and Guo, Zishan and Zhang, Xinyu and Zhang, Pei and Yang, Baosong and Xu, Jin and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}
@article{ecapa-tdnn,
  title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  journal={Proc. Interspeech},
  year={2020}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx

Collection including marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx

Paper for marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx