Instructions to use QuantTrio/Qwen3.5-2B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use QuantTrio/Qwen3.5-2B-AWQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="QuantTrio/Qwen3.5-2B-AWQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("QuantTrio/Qwen3.5-2B-AWQ")
model = AutoModelForImageTextToText.from_pretrained("QuantTrio/Qwen3.5-2B-AWQ")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use QuantTrio/Qwen3.5-2B-AWQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "QuantTrio/Qwen3.5-2B-AWQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/Qwen3.5-2B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/QuantTrio/Qwen3.5-2B-AWQ

SGLang

How to use QuantTrio/Qwen3.5-2B-AWQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "QuantTrio/Qwen3.5-2B-AWQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/Qwen3.5-2B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "QuantTrio/Qwen3.5-2B-AWQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "QuantTrio/Qwen3.5-2B-AWQ",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use QuantTrio/Qwen3.5-2B-AWQ with Docker Model Runner:
```
docker model run hf.co/QuantTrio/Qwen3.5-2B-AWQ
```

TensorRT-LLM 1.3.0rc8 fails to load Qwen3.5-2B-AWQ with unsupported AWQ quantization_config

by ernestyalumni - opened Mar 23

Discussion

ernestyalumni

Mar 23

This is more informative (especially to the folks maintaining TensorRT-LLM) than a problem with the model itself:

Trying to serve this model with trtllm-serve on TensorRT-LLM 1.3.0rc8 fails during model load because the checkpoint’s Hugging Face
AWQ quantization_config is rejected as unsupported. The error ends with NotImplementedError: Unsupported quantization_config:
{'quant_method': 'awq', 'bits': 4, ...}. Here is full error output:

root@Zephyrus-G15-GA503QR:/Data/Models/LLM/QuantTrio/Qwen3.5-2B-AWQ# trtllm-serve . --port 30000
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators,  other operators may also be overridden.
  Overriding a previously registered kernel for the same operator and the same dispatch key
  operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
    registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
  dispatch key: ADInplaceOrView
  previous kernel: no debug info
       new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
  self.m.impl(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.6 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
  _warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc8
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
  class ResponseFormat(OpenAIBaseModel):
[03/23/2026-03:30:23] [TRT-LLM] [I] Using LLM with PyTorch backend
[03/23/2026-03:30:23] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[03/23/2026-03:30:23] [TRT-LLM] [I] Found quantization_config field in ./config.json, pre-quantized checkpoint is used.
[03/23/2026-03:30:23] [TRT-LLM] [I] Use quantization_config from ./config.json: quantization_config={'quant_method': 'awq', 'bits': 4, 'group_size': 128, 'version': 'gemm', 'zero_point': True, 'modules_to_not_convert': ['visual', 'linear_attn', 'self_attn', 'model.layers.0.', 'mtp']}
Traceback (most recent call last):
  File "/usr/local/bin/trtllm-serve", line 7, in <module>
    sys.exit(main())
             ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1485, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1406, in main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1873, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1269, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 824, in invoke
    return callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 910, in serve
    _serve_llm()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 882, in _serve_llm
    launch_server(host,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 280, in launch_server
    llm = PyTorchLLM(**llm_args)
          ^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1289, in __init__
    super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1171, in __init__
    super().__init__(model,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 251, in __init__
    self._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1216, in _build_model
    super()._build_model()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 851, in _build_model
    self._engine_dir, self._hf_model_dir = model_loader()
                                           ^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 758, in __call__
    self.model_loader._update_from_hf_quant_config()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 493, in _update_from_hf_quant_config
    raise NotImplementedError(
NotImplementedError: Unsupported quantization_config: {'quant_method': 'awq', 'bits': 4, 'group_size': 128, 'version': 'gemm', 'zero_point': True, 'modules_to_not_convert': ['visual', 'linear_attn', 'self_attn', 'model.layers.0.', 'mtp']}.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment