Qwen3-Embedding-0.6B-modelopt-fp8

FP8 (E4M3) quantized version of Qwen/Qwen3-Embedding-0.6B, quantized using NVIDIA ModelOpt static FP8 quantization.

Model Details

Property Value
Base Model Qwen/Qwen3-Embedding-0.6B
Architecture Qwen3 (28 layers, 16 heads, 8 KV heads)
Hidden Size 1024
Intermediate Size 3072
Vocab Size 151,669
Max Position Embeddings 32,768
Quantization FP8 E4M3 (weights + input activations)
Quantization Method NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG)
Model Size 717 MB (safetensors)

Quantization Details

Method

  • Tool: NVIDIA ModelOpt static FP8 quantization
  • Format: FP8 E4M3 (torch.float8_e4m3fn)
  • Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and LayerNorms remain in BF16.
  • Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.

Calibration

  • Dataset: CNN/DailyMail (real text data)
  • Samples: 64
  • Sequence Length: 256
  • Batch Size: 4
  • Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, SiLU output), saved in calib.json

Precision Evaluation

Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):

Batch Seq Len Cosine Similarity Result
1 128 0.9936 PASS
2 512 0.9934 PASS
4 1024 0.9930 PASS
8 2048 0.9927 PASS
8 100 0.9937 PASS
8 500 0.9933 PASS
8 4000 0.9924 PASS

All configurations achieve >0.99 cosine similarity with the BF16 baseline.

File Structure

.
β”œβ”€β”€ config.json              # Model config with quantization_config
β”œβ”€β”€ model.safetensors        # FP8 quantized weights + scales
β”œβ”€β”€ calib.json               # Activation scales per layer
β”œβ”€β”€ tokenizer.json           # Tokenizer
β”œβ”€β”€ tokenizer_config.json    # Tokenizer config
β”œβ”€β”€ vocab.json               # Vocabulary
β”œβ”€β”€ merges.txt               # BPE merges
└── generation_config.json   # Generation config

Intended Use

This model is intended for efficient FP8 inference of text embeddings on NVIDIA GPUs with FP8 support (Hopper architecture and above).

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 1kxia/Qwen3-Embedding-0.6B-modelopt-fp8

Finetuned
(133)
this model