YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
HLC-compressed-models
Compressed LLM checkpoints from Hierarchical Low-Rank Compression for LLMs (NeurIPS 2026 submission).
Available Models
| Model | Ratio | Stage B (SVD) | Stage F (Fine-tuned) |
|---|---|---|---|
| LLaMA-7B | 20%-80% | llama7b/r{02,04,06,08}/B |
llama7b/r{02,04,06,08}/F |
| Qwen3-14B | 20% | qwen3_14b/r02/B |
qwen3_14b/r02/F |
Quick Start β Merged Format
Standard HuggingFace from_pretrained(), same size as original model:
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "zhc12/HLC-compressed-models"
# Qwen3-14B at 80% compression (fine-tuned)
model = AutoModelForCausalLM.from_pretrained(repo, subfolder="qwen3_14b/r02/F")
tokenizer = AutoTokenizer.from_pretrained(repo, subfolder="qwen3_14b/r02/F")
Factored Format β Low-Rank A, B Matrices
Each F/ directory also contains factors.pt with the low-rank factors
A (d*r) and B (r*n) for every compressed linear layer. This is ~20%
smaller than the merged weights and preserves the compression structure.
Loading factors (for analysis or continued training)
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from huggingface_hub import hf_hub_download
repo = "zhc12/HLC-compressed-models"
subfolder = "qwen3_14b/r02/F"
# Load the base merged model
model = AutoModelForCausalLM.from_pretrained(repo, subfolder=subfolder)
tokenizer = AutoTokenizer.from_pretrained(repo, subfolder=subfolder)
# Download and load factors
factors_path = hf_hub_download(repo, f"{subfolder}/factors.pt")
factors = torch.load(factors_path, map_location="cpu", weights_only=True)
# factors is a dict: {(layer_idx, sublayer_name): {"A": tensor, "B": tensor}}
# Example: factors[(0, "self_attn.q_proj")]["A"].shape = (5120, 2048)
print(f"Loaded {len(factors)} factor pairs")
Restoring low-rank structure (no external dependencies)
import torch
import torch.nn as nn
class CompressedLinear(nn.Module):
"""Low-rank linear: y = A @ (B @ x) + bias, where A is (d, r) and B is (r, n)."""
def __init__(self, A, B, bias=None):
super().__init__()
d, r = A.shape
_, n = B.shape
self.first = nn.Linear(n, r, bias=False)
self.second = nn.Linear(r, d, bias=bias is not None)
self.first.weight = nn.Parameter(B)
self.second.weight = nn.Parameter(A)
if bias is not None:
self.second.bias = nn.Parameter(bias)
def forward(self, x):
return self.second(self.first(x.to(self.first.weight.dtype))).to(x.dtype)
# Replace merged layers with factored versions
for (layer_idx, sublayer_name), f in factors.items():
layer = model.model.layers[layer_idx]
parts = sublayer_name.split(".")
parent = layer
for p in parts[:-1]:
parent = getattr(parent, p)
original = getattr(parent, parts[-1])
bias = original.bias.data if original.bias is not None else None
compressed = CompressedLinear(f["A"], f["B"], bias=bias)
setattr(parent, parts[-1], compressed)
# Now each compressed sublayer has .first.weight (B) and .second.weight (A)
# Total trainable params = sum of A and B sizes, ~20% fewer than original
Compression details
- Method: SVD-LLM whitening + mixed calibration (4096 samples, seqlen=2048)
- Compressed sublayers (7 per transformer block):
q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj - Rank formula:
r = (1 - ratio) * d * n / (d + n) - Stage B: Per-matrix whitened SVD truncation
- Stage F: End-to-end LM-loss refinement of A, B factors
License
Apache 2.0
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support