license: llama2 library_name: gptqmodel base_model: meta-llama/CodeLlama-34b-hf tags: - code - llama - gptq - quantized - 4-bit - 8-bit - multi-precision - eora - tevunahai model_type: llama pipeline_tag: text-generation inference: false quantized_by: TevunahAi

CodeLlama-34B-TevunahAi-GPTQ

Premium Ultra-Hybrid Multi-Precision GPTQ Quantization by TevunahAi

A professional-grade quantization of Meta's CodeLlama-34B using TevunahAi's Ultra-Hybrid methodologyβ€”strategically combining INT4, INT8, and FP8 precisions with EoRA (Error-corrected Low-Rank Adaptation) for maximum quality retention at significant compression.

Model Details

Property Value
Base Model meta-llama/CodeLlama-34b-hf
Parameters 34 Billion
Architecture Llama 2 (Code-specialized)
Context Length 16,384 tokens (base) / 100K (extended)
Quantization TevunahAi Ultra-Hybrid GPTQ
Compression 68GB β†’ 24.7GB (64%)
Expected Quality 98-99% retention vs FP16

Why This Quantization?

Most publicly available quantizations use uniform precision across all layers with minimal calibration (typically 128-256 samples). This leaves significant quality on the table.

TevunahAi Premium Approach:

  • 1,200 calibration samples β€” 8Γ— the industry standard, using diverse code and instruction datasets
  • Layer-aware precision β€” Critical layers (boundaries, attention) get INT8; less sensitive MLP layers get INT4
  • EoRA error correction β€” Low-rank adapters capture and correct quantization errors at every layer
  • Boundary protection β€” First and last layers use maximum EoRA rank (2048) for optimal error correction

Precision Strategy

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  LAYER 0 (Input Boundary)     β”‚  INT8 + EoRA-2048              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Attention Layers 1-46        β”‚  INT8 + EoRA-128               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  MLP Layers 1-38              β”‚  INT4 + EoRA-128               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  MLP Layers 39-46             β”‚  INT8 + EoRA-128               β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  LAYER 47 (Output Boundary)   β”‚  INT8 + EoRA-2048              β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚  Embeddings & LM Head         β”‚  BF16 (preserved)              β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Rationale:

  • Boundary layers (0 & 47): Handle input/output transformationsβ€”errors here propagate everywhere. Maximum protection with INT8 + EoRA-2048.
  • Attention layers: Preserve semantic relationships and long-range dependencies. INT8 maintains precision.
  • Early/mid MLP layers (1-38): More redundant, tolerate INT4 with EoRA recovery.
  • Late MLP layers (39-46): Closer to output, use INT8 to preserve final representations.

Architecture

CodeLlama-34B is built on Llama 2 architecture, specialized for code generation:

Component Specification
Layers 48 transformer decoder blocks
Hidden Size 8,192
Intermediate Size 22,016
Attention Heads 64 (GQA with 8 KV heads)
Vocab Size 32,016
Activation SiLU (Swish)
Normalization RMSNorm
Position Encoding RoPE (ΞΈ=1,000,000)

Key Features:

  • Grouped Query Attention (GQA) for efficient inference
  • Extended context via modified RoPE theta
  • Code-specialized tokenizer and training

Calibration

Quality calibration is where most quantizations fail. We use a diverse mix optimized for code and instruction-following:

Dataset Samples Purpose
m-a-p/Code-Feedback 200 Code Q&A, debugging
nickrosh/Evol-Instruct-Code-80k 200 Complex code synthesis
HuggingFaceH4/ultrachat_200k 200 General instruction following
Open-Orca/SlimOrca 200 Reasoning and explanation

Total: 1,200 samples at 2,048 token context

This diversity ensures the quantization captures the full activation distribution across CodeLlama's capabilities.

Usage

With GPTQModel (Recommended)

from gptqmodel import GPTQModel
from transformers import AutoTokenizer

model_id = "TevunahAi/CodeLlama-34B-TevunahAi-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = GPTQModel.from_quantized(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

prompt = "def fibonacci(n):"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With Transformers + AutoGPTQ

from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "TevunahAi/CodeLlama-34B-TevunahAi-GPTQ"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    trust_remote_code=True
)

# Same generation code as above

Code Infilling

CodeLlama supports fill-in-the-middle (FIM):

prefix = "def remove_duplicates(lst):\n    "
suffix = "\n    return result"
prompt = f"<PRE> {prefix} <SUF>{suffix} <MID>"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=100)

Hardware Requirements

Configuration VRAM Required Notes
Full model (single GPU) ~25GB RTX 5000 Ada, A100-40GB, RTX 4090
With device_map="auto" ~20GB+ Multi-GPU or CPU offload
CPU offload 16GB VRAM + 32GB RAM Slower but functional

Tested on RTX 5000 Ada: 24.53GB allocated, 24.82GB reserved

Benchmarks

Test Hardware: NVIDIA RTX 5000 Ada (32GB VRAM)

LM Evaluation Harness (Quick Sanity Test)

Benchmark Metric Score
ARC-Challenge acc_norm 42.00%
HellaSwag acc_norm 67.00%
TruthfulQA MC2 acc 39.07%
Winogrande acc 73.00%

Code Generation Tests

Test Result Notes
Basic Code Generation βœ… PASS Correct recursive factorial
Code Completion βœ… PASS 5/5 binary search elements
Bug Fixing βœ… PASS Identified and fixed bugs
Code Explanation βœ… PASS Explained quicksort logic
JavaScript Generation βœ… PASS Working longestWord function
C++ Generation ❌ FAIL Repetitive output (base model behavior)
Data Structure (Stack) βœ… PASS 7/7 implementation elements
Code Refactoring βœ… PASS 5/5 refactoring patterns
Docstring Generation βœ… PASS Added comprehensive docstrings

Code Tests: 8/9 passed

Performance

Metric Value
Inference Speed ~18.2-18.7 tok/s
GPU Memory (Allocated) 24.53 GB
GPU Memory (Reserved) 24.82 GB
CPU RAM ~1.9 GB
Model Load Time ~95 seconds

The quantized model maintains strong performance across reasoning and code generation tasks while achieving 64% compression.

What is EoRA?

Error-corrected Low-Rank Adaptation (EoRA) is a technique developed by NVIDIA that captures quantization errors in low-rank matrices applied during inference. Unlike standard quantization that simply rounds weights and accepts the error, EoRA:

  1. Quantizes the weight matrix W β†’ W_q
  2. Computes error: E = W - W_q
  3. Decomposes E β‰ˆ A Γ— B (low-rank approximation)
  4. At inference: output = W_q(x) + A(B(x))

This recovers much of the lost precision, especially important for sensitive layers like attention and model boundaries.

Reference: EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation (NVIDIA, 2024)

Files Included

CodeLlama-34B-TevunahAi-GPTQ/
β”œβ”€β”€ config.json
β”œβ”€β”€ generation_config.json
β”œβ”€β”€ model.safetensors        # Quantized weights
β”œβ”€β”€ quantize_config.json     # Quantization parameters
β”œβ”€β”€ special_tokens_map.json
β”œβ”€β”€ tokenizer.json
β”œβ”€β”€ tokenizer.model
β”œβ”€β”€ tokenizer_config.json
└── quantization_metadata.json  # Full provenance

Quantization Details

{
  "bits": 4,
  "group_size": 128,
  "desc_act": false,
  "sym": true,
  "damp_percent": 0.01,
  "dynamic": true,
  "eora_enabled": true,
  "eora_ranks": {
    "boundary": 2048,
    "attention": 128,
    "mlp": 128
  }
}

"Tevunah" (ΧͺΦ°ΦΌΧ‘Χ•ΦΌΧ ΦΈΧ”) β€” Hebrew for understanding, discernment, wisdom in application.

License

This model inherits the Llama 2 Community License. Usage must comply with Meta's acceptable use policy.

Citation

If you use this quantization, please cite:

@misc{tevunahai2025codellama34b,
  title={CodeLlama-34B-TevunahAi-GPTQ: Premium Multi-Precision Quantization},
  author={TevunahAi},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/TevunahAi/CodeLlama-34B-TevunahAi-GPTQ}
}

EoRA technique by NVIDIA:

@article{liu2024eora,
  title={EoRA: Training-free Compensation for Compressed LLM with Eigenspace Low-Rank Approximation},
  author={Liu, Shih-Yang and others},
  journal={arXiv preprint arXiv:2410.21271},
  year={2024}
}

Acknowledgments

  • Meta AI for CodeLlama
  • NVIDIA for the EoRA technique
  • GPTQModel team for the quantization framework
  • Hugging Face for model hosting and transformers

Downloads last month
8
Safetensors
Model size
46B params
Tensor type
BF16
Β·
I32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for TevunahAi/CodeLlama-34B-TevunahAi-GPTQ

Quantized
(3)
this model

Collection including TevunahAi/CodeLlama-34B-TevunahAi-GPTQ

Paper for TevunahAi/CodeLlama-34B-TevunahAi-GPTQ