z-lab/Qwen3.5-9B-PARO

Pairwise Rotation Quantization for Efficient Reasoning LLM Inference

Paper Blog Models PyPI

ParoQuant is the state-of-the-art INT4 quantization for LLMs. It closes the accuracy gap with FP16 while running at near-AWQ speed. Supports NVIDIA GPUs (vLLM, Transformers) and Apple Silicon (MLX). For more information, see https://github.com/z-lab/paroquant.

z-lab/Qwen3.5-9B-PARO is a 4-bit Qwen/Qwen3.5-9B quantized with ParoQuant. Check out other ParoQuant models from the Hugging Face collection.

Quick Start

Installation

# NVIDIA GPU (CUDA 12.9)
pip install "paroquant[vllm]"

# NVIDIA GPU (CUDA 13.0)
pip install "paroquant[vllm]" "vllm==0.17.1" \
  --extra-index-url https://wheels.vllm.ai/0.17.1/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130

# Apple Silicon
pip install "paroquant[mlx]"

Interactive Chat

python -m paroquant.cli.chat --model z-lab/Qwen3.5-9B-PARO

OpenAI-Compatible API Server

python -m paroquant.cli.serve --model z-lab/Qwen3.5-9B-PARO --port 8000

For vLLM, the arguments are passed to the vLLM server directly. See vLLM docs for more details.

For MLX, add --vlm if you wish to load the VLM components and use the model's multimodal features. For vLLM, VLM components are loaded by default and can be skipped with the server argument --language-model-only.

Docker (NVIDIA GPU)

The following commands map the local cache directory to the container in order to persist kernel cache across runs. Remove -v ... to disable this behaviour.

# Interactive chat
docker run --pull=always --rm -it --gpus all --ipc=host \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:chat --model z-lab/Qwen3.5-9B-PARO

# API server (port 8000)
docker run --pull=always --rm -it --gpus all --ipc=host -p 8000:8000 \
  -v $HOME/.cache/paroquant:/root/.cache/paroquant \
  ghcr.io/z-lab/paroquant:serve --model z-lab/Qwen3.5-9B-PARO

Citation

@inproceedings{liang2026paroquant,
  title     = {{ParoQuant: Pairwise Rotation Quantization for Efficient Reasoning LLM Inference}},
  author    = {Liang, Yesheng and Chen, Haisheng and Zhang, Zihan and Han, Song and Liu, Zhijian},
  booktitle = {International Conference on Learning Representations (ICLR)},
  year      = {2026}
}
Downloads last month
50,138
Safetensors
Model size
3B params
Tensor type
I32
·
F16
·
I16
·
MLX
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for z-lab/Qwen3.5-9B-PARO

Finetuned
Qwen/Qwen3.5-9B
Quantized
(126)
this model

Collection including z-lab/Qwen3.5-9B-PARO

Paper for z-lab/Qwen3.5-9B-PARO