license: llama2 library_name: gptqmodel base_model: meta-llama/CodeLlama-34b-hf tags: - code - llama - gptq - quantized - 4-bit - 8-bit - multi-precision - eora - tevunahai model_type: llama pipeline_tag: text-generation inference: false quantized_by: TevunahAi
CodeLlama-34B-TevunahAi-GPTQ
Premium Ultra-Hybrid Multi-Precision GPTQ Quantization by TevunahAi
A professional-grade quantization of Meta's CodeLlama-34B using TevunahAi's Ultra-Hybrid methodologyβstrategically combining INT4, INT8, and FP8 precisions with EoRA (Error-corrected Low-Rank Adaptation) for maximum quality retention at significant compression.
Model Details
| Property | Value |
|---|---|
| Base Model | meta-llama/CodeLlama-34b-hf |
| Parameters | 34 Billion |
| Architecture | Llama 2 (Code-specialized) |
| Context Length | 16,384 tokens (base) / 100K (extended) |
| Quantization | TevunahAi Ultra-Hybrid GPTQ |
| Compression | |
| Expected Quality | 98-99% retention vs FP16 |
Why This Quantization?
Most publicly available quantizations use uniform precision across all layers with minimal calibration (typically 128-256 samples). This leaves significant quality on the table.
TevunahAi Premium Approach:
- 1,200 calibration samples β 8Γ the industry standard, using diverse code and instruction datasets
- Layer-aware precision β Critical layers (boundaries, attention) get INT8; less sensitive MLP layers get INT4
- EoRA error correction β Low-rank adapters capture and correct quantization errors at every layer
- Boundary protection β First and last layers use maximum EoRA rank (2048) for optimal error correction
Precision Strategy
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LAYER 0 (Input Boundary) β INT8 + EoRA-2048 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Attention Layers 1-46 β INT8 + EoRA-128 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β MLP Layers 1-38 β INT4 + EoRA-128 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β MLP Layers 39-46 β INT8 + EoRA-128 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β LAYER 47 (Output Boundary) β INT8 + EoRA-2048 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Embeddings & LM Head β BF16 (preserved) β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Rationale:
- Boundary layers (0 & 47): Handle input/output transformationsβerrors here propagate everywhere. Maximum protection with INT8 + EoRA-2048.
- Attention layers: Preserve semantic relationships and long-range dependencies. INT8 maintains precision.
- Early/mid MLP layers (1-38): More redundant, tolerate INT4 with EoRA recovery.
- Late MLP layers (39-46): Closer to output, use INT8 to preserve final representations.
Architecture
CodeLlama-34B is built on Llama 2 architecture, specialized for code generation:
| Component | Specification |
|---|---|
| Layers | 48 transformer decoder blocks |
| Hidden Size | 8,192 |
| Intermediate Size | 22,016 |
| Attention Heads | 64 (GQA with 8 KV heads) |
| Vocab Size | 32,016 |
| Activation | SiLU (Swish) |
| Normalization | RMSNorm |
| Position Encoding | RoPE (ΞΈ=1,000,000) |
Key Features:
- Grouped Query Attention (GQA) for efficient inference
- Extended context via modified RoPE theta
- Code-specialized tokenizer and training
Calibration
Quality calibration is where most quantizations fail. We use a diverse mix optimized for code and instruction-following:
| Dataset | Samples | Purpose |
|---|---|---|
| m-a-p/Code-Feedback | 200 | Code Q&A, debugging |
| nickrosh/Evol-Instruct-Code-80k | 200 | Complex code synthesis |
| HuggingFaceH4/ultrachat_200k | 200 | General instruction following |
| Open-Orca/SlimOrca | 200 | Reasoning and explanation |
Total: 1,200 samples at 2,048 token context
This diversity ensures the quantization captures the full activation distribution across CodeLlama's capabilities.
Usage
With GPTQModel (Recommended)
from gptqmodel import GPTQModel
from transformers import AutoTokenizer
model_id = "TevunahAi/CodeLlama-34B-TevunahAi-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.from_quantized(
model_id,
device_map="auto",
trust_remote_code=True
)
prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With Transformers + AutoGPTQ
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "TevunahAi/CodeLlama-34B-TevunahAi-GPTQ"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
trust_remote_code=True
)
# Same generation code as above
Code Infilling
CodeLlama supports fill-in-the-middle (FIM):
prefix = "def remove_duplicates(lst):\n "
suffix = "\n return result"
prompt = f"<PRE> {prefix} <SUF>{suffix} <MID>"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)
Hardware Requirements
| Configuration | VRAM Required | Notes |
|---|---|---|
| Full model (single GPU) | ~25GB | RTX 5000 Ada, A100-40GB, RTX 4090 |
| With device_map="auto" | ~20GB+ | Multi-GPU or CPU offload |
| CPU offload | 16GB VRAM + 32GB RAM | Slower but functional |
Tested on RTX 5000 Ada: 24.53GB allocated, 24.82GB reserved
Benchmarks
Test Hardware: NVIDIA RTX 5000 Ada (32GB VRAM)
LM Evaluation Harness (Quick Sanity Test)
| Benchmark | Metric | Score |
|---|---|---|
| ARC-Challenge | acc_norm | 42.00% |
| HellaSwag | acc_norm | 67.00% |
| TruthfulQA MC2 | acc | 39.07% |
| Winogrande | acc | 73.00% |
Code Generation Tests
| Test | Result | Notes |
|---|---|---|
| Basic Code Generation | β PASS | Correct recursive factorial |
| Code Completion | β PASS | 5/5 binary search elements |
| Bug Fixing | β PASS | Identified and fixed bugs |
| Code Explanation | β PASS | Explained quicksort logic |
| JavaScript Generation | β PASS | Working longestWord function |
| C++ Generation | β FAIL | Repetitive output (base model behavior) |
| Data Structure (Stack) | β PASS | 7/7 implementation elements |
| Code Refactoring | β PASS | 5/5 refactoring patterns |
| Docstring Generation | β PASS | Added comprehensive docstrings |
Code Tests: 8/9 passed
Performance
| Metric | Value |
|---|---|
| Inference Speed | ~18.2-18.7 tok/s |
| GPU Memory (Allocated) | 24.53 GB |
| GPU Memory (Reserved) | 24.82 GB |
| CPU RAM | ~1.9 GB |
| Model Load Time | ~95 seconds |
The quantized model maintains strong performance across reasoning and code generation tasks while achieving 64% compression.
What is EoRA?
Error-corrected Low-Rank Adaptation (EoRA) is a technique developed by NVIDIA that captures quantization errors in low-rank matrices applied during inference. Unlike standard quantization that simply rounds weights and accepts the error, EoRA:
- Quantizes the weight matrix W β W_q
- Computes error: E = W - W_q
- Decomposes E β A Γ B (low-rank approximation)
- At inference: output = W_q(x) + A(B(x))
This recovers much of the lost precision, especially important for sensitive layers like attention and model boundaries.
Reference: EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation (NVIDIA, 2024)
Files Included
CodeLlama-34B-TevunahAi-GPTQ/
βββ config.json
βββ generation_config.json
βββ model.safetensors # Quantized weights
βββ quantize_config.json # Quantization parameters
βββ special_tokens_map.json
βββ tokenizer.json
βββ tokenizer.model
βββ tokenizer_config.json
βββ quantization_metadata.json # Full provenance
Quantization Details
{
"bits": 4,
"group_size": 128,
"desc_act": false,
"sym": true,
"damp_percent": 0.01,
"dynamic": true,
"eora_enabled": true,
"eora_ranks": {
"boundary": 2048,
"attention": 128,
"mlp": 128
}
}
"Tevunah" (ΧͺΦ°ΦΌΧΧΦΌΧ ΦΈΧ) β Hebrew for understanding, discernment, wisdom in application.
License
This model inherits the Llama 2 Community License. Usage must comply with Meta's acceptable use policy.
Citation
If you use this quantization, please cite:
@misc{tevunahai2025codellama34b,
title={CodeLlama-34B-TevunahAi-GPTQ: Premium Multi-Precision Quantization},
author={TevunahAi},
year={2025},
publisher={Hugging Face},
url={https://huggingface.co/TevunahAi/CodeLlama-34B-TevunahAi-GPTQ}
}
EoRA technique by NVIDIA:
@article{liu2024eora,
title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
author={Liu, Shih-Yang and others},
journal={arXiv preprint arXiv:2410.21271},
year={2024}
}
Acknowledgments
- Meta AI for CodeLlama
- NVIDIA for the EoRA technique
- GPTQModel team for the quantization framework
- Hugging Face for model hosting and transformers
- Downloads last month
- 8
Model tree for TevunahAi/CodeLlama-34B-TevunahAi-GPTQ
Base model
meta-llama/CodeLlama-34b-hf