HLC-compressed-models

Compressed LLM checkpoints from Hierarchical Low-Rank Compression for LLMs (NeurIPS 2026 submission).

Available Models

Model	Ratio	Stage B (SVD)	Stage F (Fine-tuned)
LLaMA-7B	20%-80%	`llama7b/r{02,04,06,08}/B`	`llama7b/r{02,04,06,08}/F`
Qwen3-14B	20%	`qwen3_14b/r02/B`	`qwen3_14b/r02/F`

Quick Start — Merged Format

Standard HuggingFace from_pretrained(), same size as original model:

from transformers import AutoModelForCausalLM, AutoTokenizer

repo = "zhc12/HLC-compressed-models"

# Qwen3-14B at 80% compression (fine-tuned)
model = AutoModelForCausalLM.from_pretrained(repo, subfolder="qwen3_14b/r02/F")
tokenizer = AutoTokenizer.from_pretrained(repo, subfolder="qwen3_14b/r02/F")

Factored Format — Low-Rank A, B Matrices

Each F/ directory also contains factors.pt with the low-rank factors A (d*r) and B (r*n) for every compressed linear layer. This is ~20% smaller than the merged weights and preserves the compression structure.

Loading factors (for analysis or continued training)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download

repo = "zhc12/HLC-compressed-models"
subfolder = "qwen3_14b/r02/F"

# Load the base merged model
model = AutoModelForCausalLM.from_pretrained(repo, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo, subfolder=subfolder)

# Download and load factors
factors_path = hf_hub_download(repo, f"{subfolder}/factors.pt")
factors = torch.load(factors_path, map_location="cpu", weights_only=True)

# factors is a dict: {(layer_idx, sublayer_name): {"A": tensor, "B": tensor}}
# Example: factors[(0, "self_attn.q_proj")]["A"].shape = (5120, 2048)
print(f"Loaded {len(factors)} factor pairs")

Restoring low-rank structure (no external dependencies)

import torch
import torch.nn as nn

class CompressedLinear(nn.Module):
    """Low-rank linear: y = A @ (B @ x) + bias, where A is (d, r) and B is (r, n)."""
    def __init__(self, A, B, bias=None):
        super().__init__()
        d, r = A.shape
        _, n = B.shape
        self.first = nn.Linear(n, r, bias=False)
        self.second = nn.Linear(r, d, bias=bias is not None)
        self.first.weight = nn.Parameter(B)
        self.second.weight = nn.Parameter(A)
        if bias is not None:
            self.second.bias = nn.Parameter(bias)

    def forward(self, x):
        return self.second(self.first(x.to(self.first.weight.dtype))).to(x.dtype)

# Replace merged layers with factored versions
for (layer_idx, sublayer_name), f in factors.items():
    layer = model.model.layers[layer_idx]
    parts = sublayer_name.split(".")
    parent = layer
    for p in parts[:-1]:
        parent = getattr(parent, p)
    original = getattr(parent, parts[-1])
    bias = original.bias.data if original.bias is not None else None
    compressed = CompressedLinear(f["A"], f["B"], bias=bias)
    setattr(parent, parts[-1], compressed)

# Now each compressed sublayer has .first.weight (B) and .second.weight (A)
# Total trainable params = sum of A and B sizes, ~20% fewer than original

Compression details

Method: SVD-LLM whitening + mixed calibration (4096 samples, seqlen=2048)
Compressed sublayers (7 per transformer block): q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Rank formula: r = (1 - ratio) * d * n / (d + n)
Stage B: Per-matrix whitened SVD truncation
Stage F: End-to-end LM-loss refinement of A, B factors

License

Apache 2.0

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support