Kybalion-1B

Kybalion-1B is a 1B-parameter language model built on top of Llama 3.2 1B through a full Continued Pre-Training (CPT) β†’ Supervised Fine-Tuning (SFT) pipeline, trained entirely on Google Colab A100.

Why "Kybalion"? The model was originally developed under the internal codename Prometheus-1B, but was renamed to Kybalion-1B before public release to avoid confusion with an existing model of the same name on HuggingFace. Kybalion refers to the ancient hermetic text symbolizing hidden knowledge β€” fitting for a model focused on education, mathematics, science, and code.


πŸ† Key Highlights

  • Beats Llama-3.2-1B-Instruct on HellaSwag (63.8% vs 61.1%) and ties on WinoGrande (62.4%)
  • 4.5Γ— GSM8K improvement over TinyLlama-1.1B (10.8% vs 2.4%) β€” math pretraining works
  • Outperforms TinyLlama-1.1B on all 6 benchmarks
  • Trained by a single undergraduate student on consumer cloud hardware

πŸ”¬ Key Contributions

  • Demonstrates that domain-balanced continued pretraining on curated multi-domain data (education, math, code, science) yields consistent improvements across commonsense reasoning benchmarks in 1B-scale models
  • Suggests that multi-step mathematical reasoning remains a fundamental bottleneck for 1B-scale models, even when combining math-focused pretraining (OpenWebMath) with instruction tuning (MetaMathQA)
  • Provides a fully reproducible, compute-efficient training recipe (CPT β†’ LoRA SFT) built and executed by a single undergraduate student in under one week, demonstrating that meaningful LLM research is achievable without institutional resources or large teams

πŸ“Š Benchmark Results

All scores measured with lm-evaluation-harness under identical conditions (same prompts, same few-shot settings, same hardware).

Benchmark TinyLlama-1.1B Llama-3.2-1B-Instruct Kybalion-1B
MMLU 25.0% 46.1% 32.0%
ARC-C 37.2% 41.5% 37.6%
GSM8K 2.4% 33.5% 10.8%
HellaSwag 61.2% 61.1% 63.8% πŸ†
WinoGrande 61.8% 62.4% 62.4% πŸ†
TruthfulQA 37.4% 43.3% 40.0%

πŸ† = outperforms Llama-3.2-1B-Instruct All evaluations run with lm_eval.simple_evaluate(), bfloat16, batch_size=8, A100 GPU.


πŸ”§ Training Pipeline

Phase 1: Continued Pre-Training (CPT)

Fine-tuned the base weights of meta-llama/Llama-3.2-1B on ~3.5B tokens of curated multi-domain data.

Domain Dataset Ratio Purpose
Education FineWeb-Edu (score β‰₯ 3.0) 35% General knowledge & reasoning
Mathematics OpenWebMath 20% Mathematical reasoning
Code StarCoderData (Python) 15% Code generation
Textbook Cosmopedia web_samples_v2 15% Structured knowledge
Science Cosmopedia stanford 10% Scientific reasoning
Story Cosmopedia stories 5% Language fluency

Training config:

  • Hardware: Google Colab A100 80GB
  • Optimizer: AdamW, LR = 2e-5, Cosine decay, Warmup = 1000 steps
  • Precision: BF16
  • Effective batch size: 32 (4 Γ— 8 grad accum)
  • Sequence length: 2048 (packed)
  • Framework: HuggingFace transformers.Trainer (no Unsloth)

Phase 2: Supervised Fine-Tuning (SFT)

Applied LoRA adapters to teach instruction-following, then merged into base weights.

Dataset Size Purpose
OpenHermes 2.5 100K General instruction following
MetaMathQA 50K Mathematical reasoning (GSM8K boost)
CodeAlpaca 20K Code generation

SFT config:

  • Method: LoRA (r=64, Ξ±=128, dropout=0.05)
  • Target modules: q/k/v/o/gate/up/down proj (all linear layers)
  • LR = 1e-4, Epochs = 3, Cosine decay
  • Merged with PeftModel.merge_and_unload() for standalone deployment

πŸ’» Usage

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("devwoo/Kybalion-1B")
model = AutoModelForCausalLM.from_pretrained(
    "devwoo/Kybalion-1B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)

def chat(user_message, system="You are a helpful and knowledgeable AI assistant."):
    prompt = (
        f"<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\n"
        f"{system}<|eot_id|>"
        f"<|start_header_id|>user<|end_header_id|>\n\n"
        f"{user_message}<|eot_id|>"
        f"<|start_header_id|>assistant<|end_header_id|>\n\n"
    )
    inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=512,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            eos_token_id=tokenizer.convert_tokens_to_ids("<|eot_id|>"),
        )
    return tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

print(chat("Explain the Pythagorean theorem and give an example."))
print(chat("Write a Python function to check if a number is prime."))

πŸ“¦ GGUF Version

A quantized GGUF q4_k_m version is available at devwoo/Kybalion-1B-GGUF for CPU/mobile inference with llama.cpp or Ollama.

# With llama.cpp
./llama-cli -m Kybalion-1B-q4_k_m.gguf -p "Explain quantum computing." -n 256

⚠️ Limitations

  • 1B parameters β€” smaller than most production models; may struggle with complex multi-step reasoning
  • Not RLHF-aligned; may occasionally produce unhelpful or inconsistent responses
  • English-only training data
  • GSM8K score (10.8%) reflects room for improvement in math reasoning compared to larger models

πŸ“„ License

This model is derived from meta-llama/Llama-3.2-1B and follows the Llama 3.2 Community License. Training datasets are used under their respective open licenses.

Downloads last month
24
Safetensors
Model size
1B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for devwoo/Kybalion-1B

Adapter
(673)
this model
Adapters
1 model

Datasets used to train devwoo/Kybalion-1B