Qwen3-Voice-Embedding-12Hz-0.6B (ONNX)

ONNX exports of the Qwen3-Voice-Embedding-12Hz-0.6B ECAPA-TDNN speaker encoder. Produces 1024-dimensional x-vector speaker embeddings from audio.

Three quantization variants are provided for different deployment targets:

File	Format	Size	Use case
`speaker_encoder_fp32.onnx`	Float32	35 MB	Maximum accuracy
`speaker_encoder_fp16.onnx`	Float16	18 MB	Browser / GPU inference (recommended)
`speaker_encoder_int8.onnx`	Int8	9 MB	Edge / mobile / minimal footprint

Usage

Python (ONNX Runtime)

import numpy as np
import onnxruntime as ort
import librosa

# Load model
session = ort.InferenceSession("speaker_encoder_fp32.onnx")

# Compute mel spectrogram (must match training preprocessing)
audio, sr = librosa.load("audio.wav", sr=24000, mono=True)
mel = librosa.feature.melspectrogram(
    y=audio, sr=24000, n_fft=1024, hop_length=256,
    n_mels=128, fmin=0, fmax=12000,
)
mel = np.log(np.clip(mel, a_min=1e-5, a_max=None))
mel = mel.T[np.newaxis, ...]  # (1, time, 128)

# Run inference
embedding = session.run(None, {"mel_spectrogram": mel.astype(np.float32)})[0]
# embedding shape: (1, 1024)

Browser (ONNX Runtime Web)

import * as ort from "onnxruntime-web";

const session = await ort.InferenceSession.create("speaker_encoder_fp16.onnx");

// mel: Float32Array of shape [1, time_steps, 128]
const tensor = new ort.Tensor("float32", mel, [1, timeSteps, 128]);
const results = await session.run({ mel_spectrogram: tensor });
const embedding = results.speaker_embedding.data; // Float32Array(1024)

Input / Output

	Name	Shape	Type
Input	`mel_spectrogram`	`(batch, time, 128)`	float32
Output	`speaker_embedding`	`(batch, 1024)`	float32

The time axis is dynamic — any length mel spectrogram is accepted.

Audio Preprocessing

Parameter	Value
Sample rate	24000 Hz
FFT size	1024
Hop length	256
Mel bins	128
Frequency range	0 - 12000 Hz
Mel scale	Slaney
Compression	`log(clamp(x, min=1e-5))`

Export Details

Exported with torch.onnx.export (opset 18) from the standalone PyTorch model. Verified against PyTorch output (cosine similarity > 0.9999).

The export script is available at: scripts/export_onnx.py

Related Models

marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B — PyTorch source model
marksverdhei/Qwen3-Voice-Embedding-12Hz-1.7B — 2048-dim variant (PyTorch)

Citation

@article{qwen3-tts,
  title={Qwen3-TTS Technical Report},
  author={Hu, Hangrui and Zhu, Xinfa and He, Ting and Guo, Dake and Zhang, Bin and Wang, Xiong and Guo, Zhifang and Jiang, Ziyue and Hao, Hongkun and Guo, Zishan and Zhang, Xinyu and Zhang, Pei and Yang, Baosong and Xu, Jin and Zhou, Jingren and Lin, Junyang},
  journal={arXiv preprint arXiv:2601.15621},
  year={2026}
}

@article{ecapa-tdnn,
  title={ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification},
  author={Desplanques, Brecht and Thienpondt, Jenthe and Demuynck, Kris},
  journal={Proc. Interspeech},
  year={2020}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx

Base model

Qwen/Qwen3-TTS-12Hz-0.6B-Base

Adapter

marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B

Quantized

(1)

this model

Collection including marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx

Qwen3 Voice Embedding

Collection

Standalone ECAPA-TDNN x-vector speaker encoders extracted from Qwen3-TTS. 1024-dim (0.6B) and 2048-dim (1.7B). • 4 items • Updated 6 days ago • 27

Paper for marksverdhei/Qwen3-Voice-Embedding-12Hz-0.6B-onnx

Qwen3-TTS Technical Report

Paper • 2601.15621 • Published Jan 22 • 69