--- license: gemma base_model: google/gemma-3-270m library_name: transformers tags: - fp8 - quantized - embedding - nvidia-modelopt pipeline_tag: feature-extraction --- # gemma-3-270m-modelopt-fp8 FP8 (E4M3) quantized version of [google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m), quantized using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) static FP8 quantization. ## Model Details | Property | Value | |----------|-------| | Base Model | [google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m) | | Architecture | Gemma3 (18 layers, 4 heads, 1 KV head) | | Hidden Size | 640 | | Intermediate Size | 2048 | | Head Dim | 256 | | Vocab Size | 262,144 | | Max Position Embeddings | 32,768 | | Attention | Sliding window (512) + full attention (every 6th layer) | | Quantization | FP8 E4M3 (weights + input activations) | | Quantization Method | NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG) | | Model Size | 416 MB (safetensors) | ## Quantization Details ### Method - **Tool**: NVIDIA ModelOpt static FP8 quantization - **Format**: FP8 E4M3 (torch.float8_e4m3fn) - **Scope**: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and RMSNorms remain in BF16. - **Scales**: Per-tensor weight scales and input activation scales are stored alongside the quantized weights. ### Calibration - **Dataset**: [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) (real text data) - **Samples**: 64 - **Sequence Length**: 256 - **Batch Size**: 4 - **Activation Scales**: Collected at 4 points per layer (post-layernorm, attention output, MLP input, GELU output), saved in calib.json ## Precision Evaluation Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99): | Batch | Seq Len | Cosine Similarity | Result | |-------|---------|-------------------|--------| | 1 | 128 | 0.9919 | PASS | | 2 | 512 | 0.9937 | PASS | | 4 | 1024 | 0.9935 | PASS | | 8 | 2048 | 0.9937 | PASS | | 8 | 100 | 0.9920 | PASS | | 8 | 500 | 0.9933 | PASS | | 8 | 4000 | 0.9937 | PASS | All configurations achieve >0.99 cosine similarity with the BF16 baseline. ## File Structure ``` . ├── config.json # Model config with quantization_config ├── model.safetensors # FP8 quantized weights + scales ├── calib.json # Activation scales per layer ├── tokenizer.json # Tokenizer ├── tokenizer_config.json # Tokenizer config ├── special_tokens_map.json # Special tokens ├── added_tokens.json # Added tokens └── generation_config.json # Generation config ``` ## Intended Use This model is intended for efficient FP8 inference on NVIDIA GPUs with FP8 support (Hopper architecture and above).