Instructions to use QuantTrio/Qwen3.5-2B-AWQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use QuantTrio/Qwen3.5-2B-AWQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="QuantTrio/Qwen3.5-2B-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForImageTextToText processor = AutoProcessor.from_pretrained("QuantTrio/Qwen3.5-2B-AWQ") model = AutoModelForImageTextToText.from_pretrained("QuantTrio/Qwen3.5-2B-AWQ") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use QuantTrio/Qwen3.5-2B-AWQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "QuantTrio/Qwen3.5-2B-AWQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3.5-2B-AWQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/QuantTrio/Qwen3.5-2B-AWQ
- SGLang
How to use QuantTrio/Qwen3.5-2B-AWQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "QuantTrio/Qwen3.5-2B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3.5-2B-AWQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "QuantTrio/Qwen3.5-2B-AWQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "QuantTrio/Qwen3.5-2B-AWQ", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use QuantTrio/Qwen3.5-2B-AWQ with Docker Model Runner:
docker model run hf.co/QuantTrio/Qwen3.5-2B-AWQ
TensorRT-LLM 1.3.0rc8 fails to load Qwen3.5-2B-AWQ with unsupported AWQ quantization_config
#1
by ernestyalumni - opened
This is more informative (especially to the folks maintaining TensorRT-LLM) than a problem with the model itself:
Trying to serve this model with trtllm-serve on TensorRT-LLM 1.3.0rc8 fails during model load because the checkpoint’s Hugging Face
AWQ quantization_config is rejected as unsupported. The error ends with NotImplementedError: Unsupported quantization_config:
{'quant_method': 'awq', 'bits': 4, ...}. Here is full error output:
root@Zephyrus-G15-GA503QR:/Data/Models/LLM/QuantTrio/Qwen3.5-2B-AWQ# trtllm-serve . --port 30000
/usr/local/lib/python3.12/dist-packages/torch/library.py:356: UserWarning: Warning only once for all operators, other operators may also be overridden.
Overriding a previously registered kernel for the same operator and the same dispatch key
operator: flash_attn::_flash_attn_backward(Tensor dout, Tensor q, Tensor k, Tensor v, Tensor out, Tensor softmax_lse, Tensor(a6!)? dq, Tensor(a7!)? dk, Tensor(a8!)? dv, float dropout_p, float softmax_scale, bool causal, SymInt window_size_left, SymInt window_size_right, float softcap, Tensor? alibi_slopes, bool deterministic, Tensor? rng_state=None) -> Tensor
registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922
dispatch key: ADInplaceOrView
previous kernel: no debug info
new kernel: registered at /usr/local/lib/python3.12/dist-packages/torch/_library/custom_ops.py:922 (Triggered internally at /opt/pytorch/pytorch/aten/src/ATen/core/dispatch/OperatorEntry.cpp:208.)
self.m.impl(
/usr/local/lib/python3.12/dist-packages/modelopt/torch/__init__.py:36: UserWarning: transformers version 4.57.6 is incompatible with nvidia-modelopt and may cause issues. Please install recommended version with `pip install nvidia-modelopt[hf]` if working with HF models.
_warnings.warn(
[TensorRT-LLM] TensorRT LLM version: 1.3.0rc8
/usr/local/lib/python3.12/dist-packages/tensorrt_llm/serve/openai_protocol.py:108: UserWarning: Field name "schema" in "ResponseFormat" shadows an attribute in parent "OpenAIBaseModel"
class ResponseFormat(OpenAIBaseModel):
[03/23/2026-03:30:23] [TRT-LLM] [I] Using LLM with PyTorch backend
[03/23/2026-03:30:23] [TRT-LLM] [I] neither checkpoint_format nor checkpoint_loader were provided, checkpoint_format will be set to HF.
[03/23/2026-03:30:23] [TRT-LLM] [I] Found quantization_config field in ./config.json, pre-quantized checkpoint is used.
[03/23/2026-03:30:23] [TRT-LLM] [I] Use quantization_config from ./config.json: quantization_config={'quant_method': 'awq', 'bits': 4, 'group_size': 128, 'version': 'gemm', 'zero_point': True, 'modules_to_not_convert': ['visual', 'linear_attn', 'self_attn', 'model.layers.0.', 'mtp']}
Traceback (most recent call last):
File "/usr/local/bin/trtllm-serve", line 7, in <module>
sys.exit(main())
^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1485, in __call__
return self.main(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1406, in main
rv = self.invoke(ctx)
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1873, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1269, in invoke
return ctx.invoke(self.callback, **ctx.params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 824, in invoke
return callback(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 910, in serve
_serve_llm()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 882, in _serve_llm
launch_server(host,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/commands/serve.py", line 280, in launch_server
llm = PyTorchLLM(**llm_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1289, in __init__
super().__init__(model, tokenizer, tokenizer_mode, skip_tokenizer_init,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1171, in __init__
super().__init__(model,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 251, in __init__
self._build_model()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 1216, in _build_model
super()._build_model()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm.py", line 851, in _build_model
self._engine_dir, self._hf_model_dir = model_loader()
^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 758, in __call__
self.model_loader._update_from_hf_quant_config()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/llm_utils.py", line 493, in _update_from_hf_quant_config
raise NotImplementedError(
NotImplementedError: Unsupported quantization_config: {'quant_method': 'awq', 'bits': 4, 'group_size': 128, 'version': 'gemm', 'zero_point': True, 'modules_to_not_convert': ['visual', 'linear_attn', 'self_attn', 'model.layers.0.', 'mtp']}.