Update README.md

132ca91 verified 4 months ago

10.5 kB

	---
	base_model: google/gemma-3-4b-it
	tags:
	- transformers
	- torchao
	- gemma3
	license: apache-2.0
	language:
	- en
	---

	# HQQ-INT8-INT4 google/gemma-3-4b-it model

	- Developed by: pytorch
	- License: apache-2.0
	- Quantized from Model : google/gemma-3-4b-it
	- Quantization Method : HQQ-INT8-INT4
	- Terms of Use: [Terms][terms]


	[Gemma3-4B](https://huggingface.co/google/gemma-3-4b-it) is quantized by the PyTorch team using [torchao](https://huggingface.co/docs/transformers/main/en/quantization/torchao) with 8-bit embeddings and 8-bit dynamic activations with 4-bit weight linears (INT8-INT4).
	The model is suitable for mobile deployment with [ExecuTorch](https://github.com/pytorch/executorch).

	We provide the [quantized pte](https://huggingface.co/pytorch/gemma-3-4b-it-HQQ-INT8-INT4/blob/main/model.pte) for direct use in ExecuTorch.
	(The provided pte file is exported with a max_seq_length/max_context_length of 1024; if you wish to change this, re-export the quantized model following the instructions in [Exporting to ExecuTorch](#exporting-to-executorch).)

	# Running in a mobile app

	To run in a mobile app, download the [quantized pte](https://huggingface.co/pytorch/gemma-3-4b-it-HQQ-INT8-INT4/blob/main/model.pte) and [tokenizer](https://huggingface.co/pytorch/gemma-3-4b-it-HQQ-INT8-INT4/blob/main/tokenizer.json) and follow the instructions [here](https://github.com/meta-pytorch/executorch-examples/tree/main/llm/apple).

	![Screenshot 2025-10-30 at 4.58.30 PM](https://cdn-uploads.huggingface.co/production/uploads/66049fc71116cebd1d3bdcf4/OPK-koiSk0kWSUVB2Aq8L.png)


	# Quantization Recipe

	First need to install the required packages:
	```Shell
	pip install git+https://github.com/huggingface/transformers@main
	pip install --pre torchao torch --index-url https://download.pytorch.org/whl/nightly/cu126
	```

	## Untie weights

	```Py
	from transformers import (
	AutoModelForCausalLM,
	AutoProcessor,
	AutoTokenizer,
	TorchAoConfig,
	)

	model_id = "google/gemma-3-4b-it"
	MODEL_NAME = model_id.split("/")[-1]
	save_to_local_path = f"{MODEL_NAME}-untied-weights"

	untied_model = AutoModelForCausalLM.from_pretrained(
	model_id, torch_dtype="auto", device_map="cuda:0"
	)
	tokenizer = AutoTokenizer.from_pretrained(model_id)
	processor = AutoProcessor.from_pretrained(model_id)

	from transformers.modeling_utils import find_tied_parameters

	if getattr(
	untied_model.config.get_text_config(decoder=True), "tie_word_embeddings"
	):
	setattr(
	untied_model.config.get_text_config(decoder=True),
	"tie_word_embeddings",
	False,
	)

	untied_model._tied_weights_keys = []
	untied_model.lm_head.weight = torch.nn.Parameter(
	untied_model.lm_head.weight.clone()
	)

	print("tied weights:", find_tied_parameters(untied_model))

	# save locally
	untied_model.save_pretrained(save_to_local_path)
	tokenizer.save_pretrained(save_to_local_path)
	processor.save_pretrained(save_to_local_path)
	```

	## Quantization

	We used following code to get the quantized model:

	```Py

	from torchao.quantization.quant_api import (
	IntxWeightOnlyConfig,
	Int8DynamicActivationIntxWeightConfig,
	ModuleFqnToConfig,
	quantize_,
	)
	from torchao.quantization.granularity import PerGroup, PerAxis
	import torch

	USER_ID = "YOUR_USER_ID"

	# We start from the model with untied weights
	model_to_quantize = save_to_local_path


	int8_int4_config = Int8DynamicActivationIntxWeightConfig(
	weight_dtype=torch.int4,
	weight_granularity=PerGroup(32),
	intx_choose_qparams_algorithm="hqq_scale_only",
	)
	int8_int8_config = Int8DynamicActivationIntxWeightConfig(
	weight_dtype=torch.int8,
	weight_granularity=PerAxis(0),
	intx_choose_qparams_algorithm="hqq_scale_only",
	)
	int8_weight_only_config = IntxWeightOnlyConfig(
	weight_dtype=torch.int8,
	granularity=PerAxis(0),
	intx_choose_qparams_algorithm="hqq_scale_only",
	)

	fqn_to_config = {}
	fqn_to_config["_default"] = int8_int4_config
	fqn_to_config["model.language_model.embed_tokens"] = int8_weight_only_config
	fqn_to_config["model.vision_tower.vision_model.embeddings.position_embedding"] = int8_weight_only_config
	for i in range(27):
	fqn_to_config[f"model.vision_tower.vision_model.encoder.layers.{i}.mlp.fc2"] = int8_int8_config
	quant_config = ModuleFqnToConfig(fqn_to_config)
	quantization_config = TorchAoConfig(quant_type=quant_config, include_input_output_embeddings=True, modules_to_not_convert=[])

	quantized_model = AutoModelForCausalLM.from_pretrained(model_to_quantize, device_map="auto", torch_dtype=torch.bfloat16, quantization_config=quantization_config)
	tokenizer = AutoTokenizer.from_pretrained(model_to_quantize)
	processor = AutoProcessor.from_pretrained(model_to_quantize)

	# Push to hub
	save_to = f"{USER_ID}/{MODEL_NAME}-HQQ-INT8-INT4"
	quantized_model.push_to_hub(save_to, safe_serialization=False)
	tokenizer.push_to_hub(save_to)
	processor.push_to_hub(save_to)

	# Manual testing
	prompt = "Hey, are you conscious? Can you talk to me?"
	messages = [
	{
	"role": "system",
	"content": "",
	},
	{"role": "user", "content": prompt},
	]
	templated_prompt = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	)
	print("Prompt:", prompt)
	print("Templated prompt:", templated_prompt)
	inputs = tokenizer(
	templated_prompt,
	return_tensors="pt",
	).to("cuda")
	generated_ids = quantized_model.generate(**inputs, max_new_tokens=128)
	output_text = tokenizer.batch_decode(
	generated_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
	)
	print("Response:", output_text[0][len(templated_prompt):])
	```

	The response from the manual testing is:

	```
	That's a really fascinating question! And a very common one when people interact with AI like me.

	The short answer is: I can simulate conversation and respond to you in a way that feels like talking, but I'm not conscious in the way a human is.
	```

	# Model Quality

	\| Benchmark \| \| \|
	\|----------------------------------\|----------------\|---------------------------\|
	\| \| gemma-3-4b-it \| pytorch/gemma-3-4b-it-HQQ-INT8-INT4 \|
	\| Benchmark \| \| \|
	\| mmlu \| 57.68 \| 55.65 \|
	\| chartqa (multimodal) \| 50.56 \| 42.88 \|


	<details>
	<summary> Reproduce Model Quality Results </summary>

	We rely on [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) to evaluate the quality of the quantized model.

	Need to install lm-eval from source: https://github.com/EleutherAI/lm-evaluation-harness#install

	## baseline
	```Shell
	lm_eval --model hf --model_args pretrained=google/gemma-3-4b-it --tasks mmlu --device cuda:0 --batch_size auto
	```

	## int8 dynamic activation and int4 weight quantization using HQQ (HQQ-INT8-INT4)
	```Shell
	lm_eval --model hf --model_args pretrained=pytorch/gemma-3-4b-it-HQQ-INT8-INT4 --tasks mmlu --device cuda:0 --batch_size auto
	```

	## multi-modal eval
	Need to install lmms-eval from source:
	`pip install git+https://github.com/EvolvingLMMs-Lab/lmms-eval.git`

	```Shell
	lmms-eval --model gemma3 --model_args "pretrained=google/gemma-3-4b-it,trust_remote_code=True,device_map=auto" --tasks chartqa --batch_size 1
	```
	</details>

	# Exporting to ExecuTorch
	To export to ExecuTorch, we use [optimum-executorch](https://github.com/huggingface/optimum-executorch/tree/main).

	We first install ExecuTorch and optimum-executorch:
	```
	# Set up executorch
	git clone https://github.com/pytorch/executorch.git
	pushd executorch
	git submodule update --init --recursive
	python install_executorch.py
	popd

	# Install optimum-executorch
	git clone https://github.com/huggingface/optimum-executorch.git
	pushd optimum-executorch
	python install_dev.py --skip_override_torch
	popd
	```

	Now we can export our model to an ExecuTorch pte file and upload it to HuggingFace. The command below exports the model for the XNNPACK backend with a context length of 1024, but this can be adjusted.

	```
	optimum-cli export executorch --model "pytorch/gemma-3-4b-it-HQQ-INT8-INT4" --task "multimodal-text-to-text" --recipe "xnnpack" --use_custom_sdpa --use_custom_kv_cache --max_seq_len 1024 --output_dir ./
	hf upload pytorch/gemma-3-4b-it-HQQ-INT8-INT4 model.pte
	```

	# Paper: TorchAO: PyTorch-Native Training-to-Serving Model Optimization
	The model's quantization is powered by TorchAO, a framework presented in the paper [TorchAO: PyTorch-Native Training-to-Serving Model Optimization](https://huggingface.co/papers/2507.16099).

	Abstract: We present TorchAO, a PyTorch-native model optimization framework leveraging quantization and sparsity to provide an end-to-end, training-to-serving workflow for AI models. TorchAO supports a variety of popular model optimization techniques, including FP8 quantized training, quantization-aware training (QAT), post-training quantization (PTQ), and 2:4 sparsity, and leverages a novel tensor subclass abstraction to represent a variety of widely-used, backend agnostic low precision data types, including INT4, INT8, FP8, MXFP4, MXFP6, and MXFP8. TorchAO integrates closely with the broader ecosystem at each step of the model optimization pipeline, from pre-training (TorchTitan) to fine-tuning (TorchTune, Axolotl) to serving (HuggingFace, vLLM, SGLang, ExecuTorch), connecting an otherwise fragmented space in a single, unified workflow. TorchAO has enabled recent launches of the quantized Llama 3.2 1B/3B and LlamaGuard3-8B models and is open-source at this https URL .

	# Resources
	* Official TorchAO GitHub Repository: [https://github.com/pytorch/ao](https://github.com/pytorch/ao)
	* TorchAO Documentation: [https://docs.pytorch.org/ao/stable/index.html](https://docs.pytorch.org/ao/stable/index.html)

	# Disclaimer
	PyTorch has not performed safety evaluations or red teamed the quantized models. Performance characteristics, outputs, and behaviors may differ from the original models. Users are solely responsible for selecting appropriate use cases, evaluating and mitigating for accuracy, safety, and fairness, ensuring security, and complying with all applicable laws and regulations.

	Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the licenses the models are released under, including any limitations of liability or disclaimers of warranties provided therein.

	[terms]: https://ai.google.dev/gemma/terms