license: llama2 library_name: gptqmodel base_model: meta-llama/CodeLlama-34b-hf tags: - code - llama - gptq - quantized - 4-bit - 8-bit - multi-precision - eora - tevunahai model_type: llama pipeline_tag: text-generation inference: false quantized_by: TevunahAi

CodeLlama-34B-TevunahAi-GPTQ

Premium Ultra-Hybrid Multi-Precision GPTQ Quantization by TevunahAi

A professional-grade quantization of Meta's CodeLlama-34B using TevunahAi's Ultra-Hybrid methodology—strategically combining INT4, INT8, and FP8 precisions with EoRA (Error-corrected Low-Rank Adaptation) for maximum quality retention at significant compression.

Model Details

Property	Value
Base Model	meta-llama/CodeLlama-34b-hf
Parameters	34 Billion
Architecture	Llama 2 (Code-specialized)
Context Length	16,384 tokens (base) / 100K (extended)
Quantization	TevunahAi Ultra-Hybrid GPTQ
Compression	~~68GB → 24.7GB (~~64%)
Expected Quality	98-99% retention vs FP16

Why This Quantization?

Most publicly available quantizations use uniform precision across all layers with minimal calibration (typically 128-256 samples). This leaves significant quality on the table.

TevunahAi Premium Approach:

1,200 calibration samples — 8× the industry standard, using diverse code and instruction datasets
Layer-aware precision — Critical layers (boundaries, attention) get INT8; less sensitive MLP layers get INT4
EoRA error correction — Low-rank adapters capture and correct quantization errors at every layer
Boundary protection — First and last layers use maximum EoRA rank (2048) for optimal error correction

Precision Strategy

┌─────────────────────────────────────────────────────────────────┐
│  LAYER 0 (Input Boundary)     │  INT8 + EoRA-2048              │
├─────────────────────────────────────────────────────────────────┤
│  Attention Layers 1-46        │  INT8 + EoRA-128               │
├─────────────────────────────────────────────────────────────────┤
│  MLP Layers 1-38              │  INT4 + EoRA-128               │
├─────────────────────────────────────────────────────────────────┤
│  MLP Layers 39-46             │  INT8 + EoRA-128               │
├─────────────────────────────────────────────────────────────────┤
│  LAYER 47 (Output Boundary)   │  INT8 + EoRA-2048              │
├─────────────────────────────────────────────────────────────────┤
│  Embeddings & LM Head         │  BF16 (preserved)              │
└─────────────────────────────────────────────────────────────────┘

Rationale:

Boundary layers (0 & 47): Handle input/output transformations—errors here propagate everywhere. Maximum protection with INT8 + EoRA-2048.
Attention layers: Preserve semantic relationships and long-range dependencies. INT8 maintains precision.
Early/mid MLP layers (1-38): More redundant, tolerate INT4 with EoRA recovery.
Late MLP layers (39-46): Closer to output, use INT8 to preserve final representations.

Architecture

CodeLlama-34B is built on Llama 2 architecture, specialized for code generation:

Component	Specification
Layers	48 transformer decoder blocks
Hidden Size	8,192
Intermediate Size	22,016
Attention Heads	64 (GQA with 8 KV heads)
Vocab Size	32,016
Activation	SiLU (Swish)
Normalization	RMSNorm
Position Encoding	RoPE (θ=1,000,000)

Key Features:

Grouped Query Attention (GQA) for efficient inference
Extended context via modified RoPE theta
Code-specialized tokenizer and training

Calibration

Quality calibration is where most quantizations fail. We use a diverse mix optimized for code and instruction-following:

Dataset	Samples	Purpose
m-a-p/Code-Feedback	200	Code Q&A, debugging
nickrosh/Evol-Instruct-Code-80k	200	Complex code synthesis
HuggingFaceH4/ultrachat_200k	200	General instruction following
Open-Orca/SlimOrca	200	Reasoning and explanation

Total: 1,200 samples at 2,048 token context

This diversity ensures the quantization captures the full activation distribution across CodeLlama's capabilities.

Usage

With GPTQModel (Recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model_id = "TevunahAi/CodeLlama-34B-TevunahAi-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.from_quantized(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Transformers + AutoGPTQ

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TevunahAi/CodeLlama-34B-TevunahAi-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

# Same generation code as above

Code Infilling

CodeLlama supports fill-in-the-middle (FIM):

prefix = "def remove_duplicates(lst):\n    "
suffix = "\n    return result"
prompt = f"<PRE> {prefix} <SUF>{suffix} <MID>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)

Hardware Requirements

Configuration	VRAM Required	Notes
Full model (single GPU)	~25GB	RTX 5000 Ada, A100-40GB, RTX 4090
With device_map="auto"	~20GB+	Multi-GPU or CPU offload
CPU offload	16GB VRAM + 32GB RAM	Slower but functional

Tested on RTX 5000 Ada: 24.53GB allocated, 24.82GB reserved

Benchmarks

Test Hardware: NVIDIA RTX 5000 Ada (32GB VRAM)

LM Evaluation Harness (Quick Sanity Test)

Benchmark	Metric	Score
ARC-Challenge	acc_norm	42.00%
HellaSwag	acc_norm	67.00%
TruthfulQA MC2	acc	39.07%
Winogrande	acc	73.00%

Code Generation Tests

Test	Result	Notes
Basic Code Generation	✅ PASS	Correct recursive factorial
Code Completion	✅ PASS	5/5 binary search elements
Bug Fixing	✅ PASS	Identified and fixed bugs
Code Explanation	✅ PASS	Explained quicksort logic
JavaScript Generation	✅ PASS	Working longestWord function
C++ Generation	❌ FAIL	Repetitive output (base model behavior)
Data Structure (Stack)	✅ PASS	7/7 implementation elements
Code Refactoring	✅ PASS	5/5 refactoring patterns
Docstring Generation	✅ PASS	Added comprehensive docstrings

Code Tests: 8/9 passed

Performance

Metric	Value
Inference Speed	~18.2-18.7 tok/s
GPU Memory (Allocated)	24.53 GB
GPU Memory (Reserved)	24.82 GB
CPU RAM	~1.9 GB
Model Load Time	~95 seconds

The quantized model maintains strong performance across reasoning and code generation tasks while achieving 64% compression.

What is EoRA?

Error-corrected Low-Rank Adaptation (EoRA) is a technique developed by NVIDIA that captures quantization errors in low-rank matrices applied during inference. Unlike standard quantization that simply rounds weights and accepts the error, EoRA:

Quantizes the weight matrix W → W_q
Computes error: E = W - W_q
Decomposes E ≈ A × B (low-rank approximation)
At inference: output = W_q(x) + A(B(x))

This recovers much of the lost precision, especially important for sensitive layers like attention and model boundaries.

Reference: EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation (NVIDIA, 2024)

Files Included

CodeLlama-34B-TevunahAi-GPTQ/
├── config.json
├── generation_config.json
├── model.safetensors        # Quantized weights
├── quantize_config.json     # Quantization parameters
├── special_tokens_map.json
├── tokenizer.json
├── tokenizer.model
├── tokenizer_config.json
└── quantization_metadata.json  # Full provenance

Quantization Details

{
  "bits": 4,
  "group_size": 128,
  "desc_act": false,
  "sym": true,
  "damp_percent": 0.01,
  "dynamic": true,
  "eora_enabled": true,
  "eora_ranks": {
    "boundary": 2048,
    "attention": 128,
    "mlp": 128
  }
}

"Tevunah" (תְּבוּנָה) — Hebrew for understanding, discernment, wisdom in application.

License

This model inherits the Llama 2 Community License. Usage must comply with Meta's acceptable use policy.

Citation

If you use this quantization, please cite:

@misc{tevunahai2025codellama34b,
  title={CodeLlama-34B-TevunahAi-GPTQ: Premium Multi-Precision Quantization},
  author={TevunahAi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/TevunahAi/CodeLlama-34B-TevunahAi-GPTQ}
}

EoRA technique by NVIDIA:

@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

Acknowledgments

Meta AI for CodeLlama
NVIDIA for the EoRA technique
GPTQModel team for the quantization framework
Hugging Face for model hosting and transformers

Downloads last month: 8

Safetensors

Model size

46B params

Tensor type

BF16

I32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TevunahAi/CodeLlama-34B-TevunahAi-GPTQ

Base model

meta-llama/CodeLlama-34b-hf

Quantized

(3)

this model

Collection including TevunahAi/CodeLlama-34B-TevunahAi-GPTQ

Ultra Quantization Hybrid Model.

Collection

These models are quantized in mixed precision that allows them to have a smaller footprint than fp8, but still high quality. • 13 items • Updated Jan 10 • 1

Paper for TevunahAi/CodeLlama-34B-TevunahAi-GPTQ

EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation

Paper • 2410.21271 • Published Oct 28, 2024 • 8