caca-1B-untrained

๐Ÿค– caca-1B-untrained

Arsitektur Transformer Modern โ€ข 1 Miliar Parameter โ€ข Belum Dilatih

License Python 3.8+ PyTorch Transformers Parameters Status

~1,000,000,000 parameters โ€ข 88 layers โ€ข 4,096 tokens context โ€ข 32,000 vocab


โš ๏ธ PENTING: Model Belum Dilatih (Untrained)

Model ini belum melalui proses training. Bobot masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.

โœ… Bisa โŒ Belum Bisa
Load arsitektur model Generate teks bermakna
Test forward pass Menjawab pertanyaan
Ukur memory & speed Reasoning & understanding
Mulai training Production deployment
Fine-tuning experiments Aplikasi real-world

๐Ÿ“‹ Deskripsi

caca-1B-untrained adalah bagian dari project Caca โ€” arsitektur LLM open-source yang menggabungkan berbagai teknik state-of-the-art. Model ini dirancang dengan fokus pada efisiensi komputasi, skalabilitas, dan performa tinggi untuk bahasa Indonesia dan Inggris.

"Caca adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative." โ€” Lyon, Creator


๐Ÿ“Š Spesifikasi Model

Parameter Value
Total Parameters ~1,000,000,000
Hidden Size 1,024
Intermediate Size 2,688
Num Layers 88
Attention Heads 8
KV Heads (GQA) 1
Head Dimension 128
Max Context Length 4,096 tokens
Vocab Size 32,000
RoPE Theta 10,000
Model Size (FP32) ~4 GB
Model Size (FP16/BF16) ~2 GB
Model Size (INT8) ~1 GB
Model Size (INT4) ~0.5 GB

๐Ÿš€ Fitur Arsitektur

๐ŸŽฏ Attention

  • โšก Flash Attention 2 โ€” IO-aware algorithm, 3x lebih cepat dari attention standar
  • ๐Ÿ”‘ Grouped Query Attention (GQA) โ€” 8 query heads : 1 KV head
    • Hemat 87.5% memory KV cache vs Multi-Head Attention
    • Kecepatan inference mendekati Multi-Query Attention
  • โœจ QK Normalization โ€” RMSNorm pada query & key untuk stabilitas training
  • ๐Ÿ”„ RoPE โ€” Rotary Position Embeddings (ฮธ=10,000)
  • ๐ŸŽฏ xFormers Support โ€” Memory-efficient attention fallback
  • โš™๏ธ PyTorch SDPA โ€” Native scaled dot product attention

๐Ÿ—๏ธ Arsitektur

  • ๐Ÿ“ RMSNorm โ€” ~50% lebih cepat dari LayerNorm, tanpa mean subtraction
  • ๐Ÿ”ฅ SwiGLU Activation โ€” Gate projection + Up projection ร— SiLU
  • ๐Ÿ’ง Residual Dropout โ€” Regularisasi pada residual connections
  • ๐Ÿ›ก๏ธ NaN/Inf Recovery โ€” Deteksi & recovery otomatis dari numerical instability
  • ๐Ÿ“Š Gradient Monitoring โ€” Per-layer gradient norm tracking & clipping
  • ๐Ÿ”„ KV Cache โ€” Dynamic cache untuk efficient autoregressive generation

๐ŸŽ“ Training Features

  • ๐Ÿ’พ Gradient Checkpointing โ€” Hemat memory dengan trade compute
  • ๐ŸŽฏ Mixed Precision โ€” Support FP16, BF16, FP32
  • ๐Ÿ“‰ Label Smoothing โ€” Configurable (default: 0.0)
  • ๐Ÿ”€ Token Dropout โ€” Optional token-level regularization
  • ๐Ÿ“ˆ Metrics Tracking โ€” Real-time loss, perplexity, gradient norms

๐Ÿ”ง Advanced (Optional, Off by Default)

  • ๐Ÿง  Mixture of Experts (MoE) โ€” Sparse expert routing
  • ๐Ÿ”€ Mixture of Depths (MoD) โ€” Dynamic compute allocation
  • ๐Ÿ”— Cross-Attention โ€” Encoder-decoder fusion
  • ๐Ÿ‘๏ธ Vision Encoder โ€” ViT-based multimodal support
  • ๐Ÿ“Š Layer Scale โ€” Training stability untuk deep networks
  • ๐ŸŽฒ Stochastic Depth โ€” Random layer dropping
  • ๐Ÿ” LoRA โ€” Low-rank adaptation via PEFT
  • ๐Ÿ“ฆ Quantization โ€” 4/8-bit via bitsandbytes
  • ๐Ÿ”ข ฮผP (MuP) โ€” Maximal Update Parametrization

๐Ÿ’พ Kebutuhan Memory

Inference

Precision Model Size KV Cache (4K ctx) Total
FP32 ~4.0 GB ~0.2 GB ~4.2 GB
FP16 / BF16 ~2.0 GB ~0.1 GB ~2.1 GB
INT8 ~1.0 GB ~0.1 GB ~1.1 GB
INT4 (NF4) ~0.5 GB ~0.1 GB ~0.6 GB

Training

Configuration Memory
FP32 + AdamW ~16 GB
Mixed Precision (BF16) ~8 GB
+ Gradient Checkpointing ~5 GB
+ LoRA (rank=16) ~3 GB

๐Ÿ“ฆ Instalasi

# Core (wajib)
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors

# Optional: performa maksimal
pip install flash-attn --no-build-isolation  # Flash Attention 2
pip install xformers                          # xFormers attention
pip install bitsandbytes                      # 4/8-bit quantization
pip install peft                              # LoRA fine-tuning

๐Ÿ’ป Cara Penggunaan

Basic Loading

from transformers import AutoConfig, AutoModelForCausalLM
import torch

# Load config
config = AutoConfig.from_pretrained(
    "Lyon28/caca-1B-untrained",
    trust_remote_code=True
)

# Load model
model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1B-untrained",
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map="auto"
)

print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# โš ๏ธ Model belum dilatih โ€” output tidak bermakna

4-bit Quantization

from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True
)

model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1B-untrained",
    trust_remote_code=True,
    quantization_config=bnb_config,
    device_map="auto"
)
# Memory: ~0.5 GB

Training Setup

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./caca-1B-untrained",
    per_device_train_batch_size=1,
    gradient_accumulation_steps=16,
    learning_rate=2e-4,
    max_steps=10000,
    lr_scheduler_type="cosine",
    warmup_steps=500,
    fp16=True,
    gradient_checkpointing=True,
    logging_steps=10,
    save_steps=500,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

trainer.train()

LoRA Fine-tuning

# Aktifkan LoRA via config
from transformers import AutoConfig, AutoModelForCausalLM

config = AutoConfig.from_pretrained("Lyon28/caca-1B-untrained", trust_remote_code=True)
config.use_lora = True
config.lora_rank = 16
config.lora_alpha = 32.0
config.lora_target_modules = ["q_proj", "v_proj"]

model = AutoModelForCausalLM.from_pretrained(
    "Lyon28/caca-1B-untrained",
    config=config,
    trust_remote_code=True,
    torch_dtype=torch.float16,
)

model = model.apply_lora()
model.print_trainable_parameters()
# trainable params: ~2M || all params: ~1B || trainable%: ~0.2%

Chat Format

# Template chat bawaan
messages = [
    {"role": "system", "content": "Kamu adalah caca yang membantu."},
    {"role": "user", "content": "Jelaskan tentang machine learning."},
]

# Format manual
prompt = "System: Kamu adalah caca yang membantu.\nUser: Jelaskan tentang machine learning.\nAssistant:"

๐Ÿ”ฌ Detail Arsitektur

CacaForCausalLM (~1B params)
โ”‚
โ”œโ”€ Embedding: 32,000 ร— 1,024 = 32,768,000 params
โ”‚
โ”œโ”€ Transformer Layers (88ร—)
โ”‚  โ”œโ”€ RMSNorm (input)
โ”‚  โ”œโ”€ CacaAttention (GQA)
โ”‚  โ”‚  โ”œโ”€ Q: 8 heads ร— 128 dim โ†’ Linear(1024, 1024)
โ”‚  โ”‚  โ”œโ”€ K: 1 head  ร— 128 dim โ†’ Linear(1024, 128)
โ”‚  โ”‚  โ”œโ”€ V: 1 head  ร— 128 dim โ†’ Linear(1024, 128)
โ”‚  โ”‚  โ”œโ”€ O: Linear(1024, 1024)
โ”‚  โ”‚  โ”œโ”€ QK Norm (RMSNorm per head)
โ”‚  โ”‚  โ””โ”€ RoPE (ฮธ=10,000)
โ”‚  โ”œโ”€ Residual + Dropout
โ”‚  โ”œโ”€ RMSNorm (post-attention)
โ”‚  โ”œโ”€ CacaMLP (SwiGLU)
โ”‚  โ”‚  โ”œโ”€ Gate: Linear(1024, 2688)
โ”‚  โ”‚  โ”œโ”€ Up:   Linear(1024, 2688)
โ”‚  โ”‚  โ””โ”€ Down: Linear(2688, 1024)
โ”‚  โ””โ”€ Residual + Dropout
โ”‚
โ”œโ”€ Final RMSNorm
โ””โ”€ LM Head: Linear(1024, 32000)

โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•
Parameter breakdown per layer:
  Attention: 1,024ร—(1,024 + 128 + 128 + 1,024) = 2,359,296
  FFN:       1,024ร—2,688ร—3 = 8,257,536
  Norms:     1,024ร—2 = 2,048
  Total/layer: ~10,618,880

88 layers ร— ~10.6M = ~934M
+ Embeddings: ~33M
+ LM Head: ~33M
= ~1,000M total
โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•โ•

GQA vs MHA Comparison

Multi-Head Attention (MHA):
  Q: 8 heads, K: 8 heads, V: 8 heads
  KV cache: 8 ร— 128 ร— 2 = 2,048 values per token

Grouped Query Attention (GQA) โ€” caca-1B-untrained:
  Q: 8 heads, K: 1 head, V: 1 head
  KV cache: 1 ร— 128 ร— 2 = 256 values per token
  Saving: 87.5% โ†“ memory for KV cache

โš™๏ธ Konfigurasi Lengkap

{
  "model_type": "caca",
  "architectures": ["CacaForCausalLM"],
  "vocab_size": 32000,
  "hidden_size": 1024,
  "intermediate_size": 2688,
  "num_hidden_layers": 88,
  "num_attention_heads": 8,
  "num_key_value_heads": 1,
  "head_dim": 128,
  "max_position_embeddings": 4096,
  "rope_theta": 10000.0,
  "rms_norm_eps": 1e-6,
  "use_qk_norm": true,
  "use_flash_attn": true,
  "use_rotary_embeddings": true,
  "attention_dropout": 0.0,
  "hidden_dropout": 0.1,
  "residual_dropout": 0.1,
  "use_cache": true,
  "tie_word_embeddings": false
}

๐Ÿ› ๏ธ Tips Training

# Recommended hyperparameters untuk caca-1B-untrained

learning_rate = 2e-4          # Base LR
warmup_ratio = 0.05           # 5% warmup
lr_scheduler = "cosine"
weight_decay = 0.1
max_grad_norm = 1.0
batch_size_effective = 256    # batch ร— accum ร— gpus

# GPU requirements:
# A100 40GB  โ†’ batch=2, accum=8, fp16
# RTX 3090   โ†’ batch=1, accum=16, fp16 + grad_checkpoint
# RTX 4090   โ†’ batch=1, accum=16, bf16 + grad_checkpoint

๐Ÿ”ง Troubleshooting

Out of Memory:

model.gradient_checkpointing_enable()       # -40% memory
# + reduce batch size
# + load_in_8bit=True atau load_in_4bit=True
# + torch_dtype=torch.bfloat16

NaN Loss:

# Gunakan BF16 (lebih stable dari FP16)
torch_dtype = torch.bfloat16
# Atau kurangi learning rate 10x

Slow Training:

# Pastikan flash-attn terinstall
pip install flash-attn --no-build-isolation
# Compile model (PyTorch 2.0+)
model = torch.compile(model)

๐Ÿ“œ License & Citation

Model ini dirilis di bawah Apache License 2.0 โ€” bebas digunakan untuk keperluan komersial maupun non-komersial dengan attribution.

@misc{caca1b,
  author = {Lyon},
  title = {caca-1B-untrained: Modern Transformer with Grouped Query Attention},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Lyon28/caca-1B-untrained}},
  note = {Untrained model with ~1B parameters}
}

๐Ÿ™ Acknowledgments

  • Flash Attention (Tri Dao et al.) โ€” IO-aware attention algorithm
  • GQA (Ainslie et al., Google) โ€” Grouped Query Attention
  • LLaMA (Meta AI) โ€” Decoder-only architecture inspiration
  • RoPE (Su et al.) โ€” Rotary position embeddings
  • SwiGLU (Shazeer) โ€” Gated linear unit activation
  • ๐Ÿค— Hugging Face โ€” Transformers library & infrastructure

Dibuat dengan โค๏ธ oleh @Lyon28

"Dari nol, untuk semua"

Star

Downloads last month
709
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support