gemma-3-270m-modelopt-fp8

FP8 (E4M3) quantized version of google/gemma-3-270m, quantized using NVIDIA ModelOpt static FP8 quantization.

Model Details

Property Value
Base Model google/gemma-3-270m
Architecture Gemma3 (18 layers, 4 heads, 1 KV head)
Hidden Size 640
Intermediate Size 2048
Head Dim 256
Vocab Size 262,144
Max Position Embeddings 32,768
Attention Sliding window (512) + full attention (every 6th layer)
Quantization FP8 E4M3 (weights + input activations)
Quantization Method NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG)
Model Size 416 MB (safetensors)

Quantization Details

Method

  • Tool: NVIDIA ModelOpt static FP8 quantization
  • Format: FP8 E4M3 (torch.float8_e4m3fn)
  • Scope: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and RMSNorms remain in BF16.
  • Scales: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.

Calibration

  • Dataset: CNN/DailyMail (real text data)
  • Samples: 64
  • Sequence Length: 256
  • Batch Size: 4
  • Activation Scales: Collected at 4 points per layer (post-layernorm, attention output, MLP input, GELU output), saved in calib.json

Precision Evaluation

Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):

Batch Seq Len Cosine Similarity Result
1 128 0.9919 PASS
2 512 0.9937 PASS
4 1024 0.9935 PASS
8 2048 0.9937 PASS
8 100 0.9920 PASS
8 500 0.9933 PASS
8 4000 0.9937 PASS

All configurations achieve >0.99 cosine similarity with the BF16 baseline.

File Structure

.
β”œβ”€β”€ config.json              # Model config with quantization_config
β”œβ”€β”€ model.safetensors        # FP8 quantized weights + scales
β”œβ”€β”€ calib.json               # Activation scales per layer
β”œβ”€β”€ tokenizer.json           # Tokenizer
β”œβ”€β”€ tokenizer_config.json    # Tokenizer config
β”œβ”€β”€ special_tokens_map.json  # Special tokens
β”œβ”€β”€ added_tokens.json        # Added tokens
└── generation_config.json   # Generation config

Intended Use

This model is intended for efficient FP8 inference on NVIDIA GPUs with FP8 support (Hopper architecture and above).

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for 1kxia/gemma-3-270m-modelopt-fp8

Finetuned
(129)
this model