---
license: gemma
base_model: google/gemma-3-270m
library_name: transformers
tags:
  - fp8
  - quantized
  - embedding
  - nvidia-modelopt

pipeline_tag: feature-extraction
---

# gemma-3-270m-modelopt-fp8

FP8 (E4M3) quantized version of [google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m), quantized using [NVIDIA ModelOpt](https://github.com/NVIDIA/TensorRT-Model-Optimizer) static FP8 quantization.

## Model Details

| Property | Value |
|----------|-------|
| Base Model | [google/gemma-3-270m](https://huggingface.co/google/gemma-3-270m) |
| Architecture | Gemma3 (18 layers, 4 heads, 1 KV head) |
| Hidden Size | 640 |
| Intermediate Size | 2048 |
| Head Dim | 256 |
| Vocab Size | 262,144 |
| Max Position Embeddings | 32,768 |
| Attention | Sliding window (512) + full attention (every 6th layer) |
| Quantization | FP8 E4M3 (weights + input activations) |
| Quantization Method | NVIDIA ModelOpt (mtq.FP8_DEFAULT_CFG) |
| Model Size | 416 MB (safetensors) |

## Quantization Details

### Method

- **Tool**: NVIDIA ModelOpt static FP8 quantization
- **Format**: FP8 E4M3 (torch.float8_e4m3fn)
- **Scope**: All linear layers (QKV projections, output projections, MLP layers) are quantized to FP8. Embeddings and RMSNorms remain in BF16.
- **Scales**: Per-tensor weight scales and input activation scales are stored alongside the quantized weights.

### Calibration

- **Dataset**: [CNN/DailyMail](https://huggingface.co/datasets/cnn_dailymail) (real text data)
- **Samples**: 64
- **Sequence Length**: 256
- **Batch Size**: 4
- **Activation Scales**: Collected at 4 points per layer (post-layernorm, attention output, MLP input, GELU output), saved in calib.json

## Precision Evaluation

Cosine similarity between this FP8 model and the original BF16 model, measured on CNN/DailyMail text inputs (threshold: 0.99):

| Batch | Seq Len | Cosine Similarity | Result |
|-------|---------|-------------------|--------|
| 1 | 128 | 0.9919 | PASS |
| 2 | 512 | 0.9937 | PASS |
| 4 | 1024 | 0.9935 | PASS |
| 8 | 2048 | 0.9937 | PASS |
| 8 | 100 | 0.9920 | PASS |
| 8 | 500 | 0.9933 | PASS |
| 8 | 4000 | 0.9937 | PASS |

All configurations achieve >0.99 cosine similarity with the BF16 baseline.

## File Structure

```
.
├── config.json              # Model config with quantization_config
├── model.safetensors        # FP8 quantized weights + scales
├── calib.json               # Activation scales per layer
├── tokenizer.json           # Tokenizer
├── tokenizer_config.json    # Tokenizer config
├── special_tokens_map.json  # Special tokens
├── added_tokens.json        # Added tokens
└── generation_config.json   # Generation config
```

## Intended Use

This model is intended for efficient FP8 inference on NVIDIA GPUs with FP8 support (Hopper architecture and above).