| | --- |
| | base_model: google/gemma-3-4b-it |
| | tags: |
| | - transformers |
| | - torchao |
| | - gemma3 |
| | license: apache-2.0 |
| | language: |
| | - en |
| | --- |
| | |
| | # HQQ-INT8-INT4 google/gemma-3-4b-it model |
| |
|
| | - **Developed by:** pytorch |
| | - **License:** apache-2.0 |
| | - **Quantized from Model :** google/gemma-3-4b-it |
| | - **Quantization Method :** HQQ-INT8-INT4 |
| | - **Terms of Use**: [Terms][terms] |
| |
|
| |
|
| | [Gemma3-4B](https://huggingface.co/google/gemma-3-4b-it) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4). |
| | The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch). |
| |
|
| | We provide the [quantized pte](https://huggingface.co/pytorch/gemma-3-4b-it-HQQ-INT8-INT4/blob/main/model.pte) for direct use in ExecuTorch. |
| | (The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).) |
| |
|
| | # Running in a mobile app |
| |
|
| | To run in a mobile app, download the [quantized pte](https://huggingface.co/pytorch/gemma-3-4b-it-HQQ-INT8-INT4/blob/main/model.pte) and [tokenizer](https://huggingface.co/pytorch/gemma-3-4b-it-HQQ-INT8-INT4/blob/main/tokenizer.json) and follow the instructions [here](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple). |
| |
|
| |  |
| |
|
| |
|
| | # Quantization Recipe |
| |
|
| | First need to install the required packages: |
| | ```Shell |
| | pip install git+https://github.com/huggingface/transformers@main |
| | pip install --pre torchao torch --index-url https://download.pytorch.org/whl/nightly/cu126 |
| | ``` |
| |
|
| | ## Untie weights |
| |
|
| | ```Py |
| | from transformers import ( |
| | AutoModelForCausalLM, |
| | AutoProcessor, |
| | AutoTokenizer, |
| | TorchAoConfig, |
| | ) |
| | |
| | model_id = "google/gemma-3-4b-it" |
| | MODEL_NAME = model_id.split("/")[-1] |
| | save_to_local_path = f"{MODEL_NAME}-untied-weights" |
| | |
| | untied_model = AutoModelForCausalLM.from_pretrained( |
| | model_id, torch_dtype="auto", device_map="cuda:0" |
| | ) |
| | tokenizer = AutoTokenizer.from_pretrained(model_id) |
| | processor = AutoProcessor.from_pretrained(model_id) |
| | |
| | from transformers.modeling_utils import find_tied_parameters |
| | |
| | if getattr( |
| | untied_model.config.get_text_config(decoder=True), "tie_word_embeddings" |
| | ): |
| | setattr( |
| | untied_model.config.get_text_config(decoder=True), |
| | "tie_word_embeddings", |
| | False, |
| | ) |
| | |
| | untied_model._tied_weights_keys = [] |
| | untied_model.lm_head.weight = torch.nn.Parameter( |
| | untied_model.lm_head.weight.clone() |
| | ) |
| | |
| | print("tied weights:", find_tied_parameters(untied_model)) |
| | |
| | # save locally |
| | untied_model.save_pretrained(save_to_local_path) |
| | tokenizer.save_pretrained(save_to_local_path) |
| | processor.save_pretrained(save_to_local_path) |
| | ``` |
| |
|
| | ## Quantization |
| |
|
| | We used following code to get the quantized model: |
| |
|
| | ```Py |
| | |
| | from torchao.quantization.quant_api import ( |
| | IntxWeightOnlyConfig, |
| | Int8DynamicActivationIntxWeightConfig, |
| | ModuleFqnToConfig, |
| | quantize_, |
| | ) |
| | from torchao.quantization.granularity import PerGroup, PerAxis |
| | import torch |
| | |
| | USER_ID = "YOUR_USER_ID" |
| | |
| | # We start from the model with untied weights |
| | model_to_quantize = save_to_local_path |
| | |
| | |
| | int8_int4_config = Int8DynamicActivationIntxWeightConfig( |
| | weight_dtype=torch.int4, |
| | weight_granularity=PerGroup(32), |
| | intx_choose_qparams_algorithm="hqq_scale_only", |
| | ) |
| | int8_int8_config = Int8DynamicActivationIntxWeightConfig( |
| | weight_dtype=torch.int8, |
| | weight_granularity=PerAxis(0), |
| | intx_choose_qparams_algorithm="hqq_scale_only", |
| | ) |
| | int8_weight_only_config = IntxWeightOnlyConfig( |
| | weight_dtype=torch.int8, |
| | granularity=PerAxis(0), |
| | intx_choose_qparams_algorithm="hqq_scale_only", |
| | ) |
| | |
| | fqn_to_config = {} |
| | fqn_to_config["_default"] = int8_int4_config |
| | fqn_to_config["model.language_model.embed_tokens"] = int8_weight_only_config |
| | fqn_to_config["model.vision_tower.vision_model.embeddings.position_embedding"] = int8_weight_only_config |
| | for i in range(27): |
| | fqn_to_config[f"model.vision_tower.vision_model.encoder.layers.{i}.mlp.fc2"] = int8_int8_config |
| | quant_config = ModuleFqnToConfig(fqn_to_config) |
| | quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[]) |
| | |
| | quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config) |
| | tokenizer = AutoTokenizer.from_pretrained(model_to_quantize) |
| | processor = AutoProcessor.from_pretrained(model_to_quantize) |
| | |
| | # Push to hub |
| | save_to = f"{USER_ID}/{MODEL_NAME}-HQQ-INT8-INT4" |
| | quantized_model.push_to_hub(save_to, safe_serialization=False) |
| | tokenizer.push_to_hub(save_to) |
| | processor.push_to_hub(save_to) |
| | |
| | # Manual testing |
| | prompt = "Hey, are you conscious? Can you talk to me?" |
| | messages = [ |
| | { |
| | "role": "system", |
| | "content": "", |
| | }, |
| | {"role": "user", "content": prompt}, |
| | ] |
| | templated_prompt = tokenizer.apply_chat_template( |
| | messages, |
| | tokenize=False, |
| | add_generation_prompt=True, |
| | ) |
| | print("Prompt:", prompt) |
| | print("Templated prompt:", templated_prompt) |
| | inputs = tokenizer( |
| | templated_prompt, |
| | return_tensors="pt", |
| | ).to("cuda") |
| | generated_ids = quantized_model.generate(**inputs, max_new_tokens=128) |
| | output_text = tokenizer.batch_decode( |
| | generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False |
| | ) |
| | print("Response:", output_text[0][len(templated_prompt):]) |
| | ``` |
| |
|
| | The response from the manual testing is: |
| |
|
| | ``` |
| | That's a really fascinating question! And a very common one when people interact with AI like me. |
| | |
| | The short answer is: I can *simulate* conversation and respond to you in a way that *feels* like talking, but I'm not conscious in the way a human is. |
| | ``` |
| |
|
| | # Model Quality |
| |
|
| | | Benchmark | | | |
| | |----------------------------------|----------------|---------------------------| |
| | | | gemma-3-4b-it | pytorch/gemma-3-4b-it-HQQ-INT8-INT4 | |
| | | **Benchmark** | | | |
| | | mmlu | 57.68 | 55.65 | |
| | | chartqa (multimodal) | 50.56 | 42.88 | |
| |
|
| |
|
| | <details> |
| | <summary> Reproduce Model Quality Results </summary> |
| |
|
| | We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model. |
| |
|
| | Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install |
| |
|
| | ## baseline |
| | ```Shell |
| | lm_eval --model hf --model_args pretrained=google/gemma-3-4b-it --tasks mmlu --device cuda:0 --batch_size auto |
| | ``` |
| |
|
| | ## int8 dynamic activation and int4 weight quantization using HQQ (HQQ-INT8-INT4) |
| | ```Shell |
| | lm_eval --model hf --model_args pretrained=pytorch/gemma-3-4b-it-HQQ-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto |
| | ``` |
| |
|
| | ## multi-modal eval |
| | Need to install lmms-eval from source: |
| | `pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git` |
| |
|
| | ```Shell |
| | lmms-eval --model gemma3 --model_args "pretrained=google/gemma-3-4b-it,trust_remote_code=True,device_map=auto" --tasks chartqa --batch_size 1 |
| | ``` |
| | </details> |
| |
|
| | # Exporting to ExecuTorch |
| | To export to ExecuTorch, we use [optimum-executorch](https://github.com/huggingface/optimum-executorch/tree/main). |
| |
|
| | We first install ExecuTorch and optimum-executorch: |
| | ``` |
| | # Set up executorch |
| | git clone https://github.com/pytorch/executorch.git |
| | pushd executorch |
| | git submodule update --init --recursive |
| | python install_executorch.py |
| | popd |
| | |
| | # Install optimum-executorch |
| | git clone https://github.com/huggingface/optimum-executorch.git |
| | pushd optimum-executorch |
| | python install_dev.py --skip_override_torch |
| | popd |
| | ``` |
| |
|
| | Now we can export our model to an ExecuTorch pte file and upload it to HuggingFace. The command below exports the model for the XNNPACK backend with a context length of 1024, but this can be adjusted. |
| |
|
| | ``` |
| | optimum-cli export executorch --model "pytorch/gemma-3-4b-it-HQQ-INT8-INT4" --task "multimodal-text-to-text" --recipe "xnnpack" --use_custom_sdpa --use_custom_kv_cache --max_seq_len 1024 --output_dir ./ |
| | hf upload pytorch/gemma-3-4b-it-HQQ-INT8-INT4 model.pte |
| | ``` |
| |
|
| | # Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization |
| | The model's quantization is powered by **TorchAO**, a framework presented in the paper [TorchAO: PyTorch-Native Training-to-Serving Model Optimization](https://huggingface.co/papers/2507.16099). |
| |
|
| | **Abstract:** We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL . |
| |
|
| | # Resources |
| | * **Official TorchAO GitHub Repository:** [https://github.com/pytorch/ao](https://github.com/pytorch/ao) |
| | * **TorchAO Documentation:** [https://docs.pytorch.org/ao/stable/index.html](https://docs.pytorch.org/ao/stable/index.html) |
| |
|
| | # Disclaimer |
| | PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations. |
| |
|
| | Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein. |
| |
|
| | [terms]: https://ai.google.dev/gemma/terms |