๐ค caca-1B-untrained
Arsitektur Transformer Modern โข 1 Miliar Parameter โข Belum Dilatih
~1,000,000,000 parameters โข 88 layers โข 4,096 tokens context โข 32,000 vocab
โ ๏ธ PENTING: Model Belum Dilatih (Untrained)
Model ini belum melalui proses training. Bobot masih dalam kondisi random initialization. Output yang dihasilkan akan tidak bermakna dan acak.
| โ Bisa | โ Belum Bisa |
|---|---|
| Load arsitektur model | Generate teks bermakna |
| Test forward pass | Menjawab pertanyaan |
| Ukur memory & speed | Reasoning & understanding |
| Mulai training | Production deployment |
| Fine-tuning experiments | Aplikasi real-world |
๐ Deskripsi
caca-1B-untrained adalah bagian dari project Caca โ arsitektur LLM open-source yang menggabungkan berbagai teknik state-of-the-art. Model ini dirancang dengan fokus pada efisiensi komputasi, skalabilitas, dan performa tinggi untuk bahasa Indonesia dan Inggris.
"Caca adalah eksperimen open-source Indonesian LLM yang dibuat dari nol secara individual. Bukan kompetitor siapa-siapa, cuma pengen eksplorasi apa yang bisa dilakukan dengan budget terbatas, passion unlimited, dan mindset collaborative." โ Lyon, Creator
๐ Spesifikasi Model
| Parameter | Value |
|---|---|
| Total Parameters | ~1,000,000,000 |
| Hidden Size | 1,024 |
| Intermediate Size | 2,688 |
| Num Layers | 88 |
| Attention Heads | 8 |
| KV Heads (GQA) | 1 |
| Head Dimension | 128 |
| Max Context Length | 4,096 tokens |
| Vocab Size | 32,000 |
| RoPE Theta | 10,000 |
| Model Size (FP32) | ~4 GB |
| Model Size (FP16/BF16) | ~2 GB |
| Model Size (INT8) | ~1 GB |
| Model Size (INT4) | ~0.5 GB |
๐ Fitur Arsitektur
๐ฏ Attention
- โก Flash Attention 2 โ IO-aware algorithm, 3x lebih cepat dari attention standar
- ๐ Grouped Query Attention (GQA) โ 8 query heads : 1 KV head
- Hemat 87.5% memory KV cache vs Multi-Head Attention
- Kecepatan inference mendekati Multi-Query Attention
- โจ QK Normalization โ RMSNorm pada query & key untuk stabilitas training
- ๐ RoPE โ Rotary Position Embeddings (ฮธ=10,000)
- ๐ฏ xFormers Support โ Memory-efficient attention fallback
- โ๏ธ PyTorch SDPA โ Native scaled dot product attention
๐๏ธ Arsitektur
- ๐ RMSNorm โ ~50% lebih cepat dari LayerNorm, tanpa mean subtraction
- ๐ฅ SwiGLU Activation โ Gate projection + Up projection ร SiLU
- ๐ง Residual Dropout โ Regularisasi pada residual connections
- ๐ก๏ธ NaN/Inf Recovery โ Deteksi & recovery otomatis dari numerical instability
- ๐ Gradient Monitoring โ Per-layer gradient norm tracking & clipping
- ๐ KV Cache โ Dynamic cache untuk efficient autoregressive generation
๐ Training Features
- ๐พ Gradient Checkpointing โ Hemat memory dengan trade compute
- ๐ฏ Mixed Precision โ Support FP16, BF16, FP32
- ๐ Label Smoothing โ Configurable (default: 0.0)
- ๐ Token Dropout โ Optional token-level regularization
- ๐ Metrics Tracking โ Real-time loss, perplexity, gradient norms
๐ง Advanced (Optional, Off by Default)
- ๐ง Mixture of Experts (MoE) โ Sparse expert routing
- ๐ Mixture of Depths (MoD) โ Dynamic compute allocation
- ๐ Cross-Attention โ Encoder-decoder fusion
- ๐๏ธ Vision Encoder โ ViT-based multimodal support
- ๐ Layer Scale โ Training stability untuk deep networks
- ๐ฒ Stochastic Depth โ Random layer dropping
- ๐ LoRA โ Low-rank adaptation via PEFT
- ๐ฆ Quantization โ 4/8-bit via bitsandbytes
- ๐ข ฮผP (MuP) โ Maximal Update Parametrization
๐พ Kebutuhan Memory
Inference
| Precision | Model Size | KV Cache (4K ctx) | Total |
|---|---|---|---|
| FP32 | ~4.0 GB | ~0.2 GB | ~4.2 GB |
| FP16 / BF16 | ~2.0 GB | ~0.1 GB | ~2.1 GB |
| INT8 | ~1.0 GB | ~0.1 GB | ~1.1 GB |
| INT4 (NF4) | ~0.5 GB | ~0.1 GB | ~0.6 GB |
Training
| Configuration | Memory |
|---|---|
| FP32 + AdamW | ~16 GB |
| Mixed Precision (BF16) | ~8 GB |
| + Gradient Checkpointing | ~5 GB |
| + LoRA (rank=16) | ~3 GB |
๐ฆ Instalasi
# Core (wajib)
pip install torch>=2.0.0 transformers>=4.35.0 accelerate safetensors
# Optional: performa maksimal
pip install flash-attn --no-build-isolation # Flash Attention 2
pip install xformers # xFormers attention
pip install bitsandbytes # 4/8-bit quantization
pip install peft # LoRA fine-tuning
๐ป Cara Penggunaan
Basic Loading
from transformers import AutoConfig, AutoModelForCausalLM
import torch
# Load config
config = AutoConfig.from_pretrained(
"Lyon28/caca-1B-untrained",
trust_remote_code=True
)
# Load model
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-1B-untrained",
trust_remote_code=True,
torch_dtype=torch.float16,
device_map="auto"
)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")
# โ ๏ธ Model belum dilatih โ output tidak bermakna
4-bit Quantization
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True
)
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-1B-untrained",
trust_remote_code=True,
quantization_config=bnb_config,
device_map="auto"
)
# Memory: ~0.5 GB
Training Setup
from transformers import TrainingArguments, Trainer
training_args = TrainingArguments(
output_dir="./caca-1B-untrained",
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
learning_rate=2e-4,
max_steps=10000,
lr_scheduler_type="cosine",
warmup_steps=500,
fp16=True,
gradient_checkpointing=True,
logging_steps=10,
save_steps=500,
)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
trainer.train()
LoRA Fine-tuning
# Aktifkan LoRA via config
from transformers import AutoConfig, AutoModelForCausalLM
config = AutoConfig.from_pretrained("Lyon28/caca-1B-untrained", trust_remote_code=True)
config.use_lora = True
config.lora_rank = 16
config.lora_alpha = 32.0
config.lora_target_modules = ["q_proj", "v_proj"]
model = AutoModelForCausalLM.from_pretrained(
"Lyon28/caca-1B-untrained",
config=config,
trust_remote_code=True,
torch_dtype=torch.float16,
)
model = model.apply_lora()
model.print_trainable_parameters()
# trainable params: ~2M || all params: ~1B || trainable%: ~0.2%
Chat Format
# Template chat bawaan
messages = [
{"role": "system", "content": "Kamu adalah caca yang membantu."},
{"role": "user", "content": "Jelaskan tentang machine learning."},
]
# Format manual
prompt = "System: Kamu adalah caca yang membantu.\nUser: Jelaskan tentang machine learning.\nAssistant:"
๐ฌ Detail Arsitektur
CacaForCausalLM (~1B params)
โ
โโ Embedding: 32,000 ร 1,024 = 32,768,000 params
โ
โโ Transformer Layers (88ร)
โ โโ RMSNorm (input)
โ โโ CacaAttention (GQA)
โ โ โโ Q: 8 heads ร 128 dim โ Linear(1024, 1024)
โ โ โโ K: 1 head ร 128 dim โ Linear(1024, 128)
โ โ โโ V: 1 head ร 128 dim โ Linear(1024, 128)
โ โ โโ O: Linear(1024, 1024)
โ โ โโ QK Norm (RMSNorm per head)
โ โ โโ RoPE (ฮธ=10,000)
โ โโ Residual + Dropout
โ โโ RMSNorm (post-attention)
โ โโ CacaMLP (SwiGLU)
โ โ โโ Gate: Linear(1024, 2688)
โ โ โโ Up: Linear(1024, 2688)
โ โ โโ Down: Linear(2688, 1024)
โ โโ Residual + Dropout
โ
โโ Final RMSNorm
โโ LM Head: Linear(1024, 32000)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Parameter breakdown per layer:
Attention: 1,024ร(1,024 + 128 + 128 + 1,024) = 2,359,296
FFN: 1,024ร2,688ร3 = 8,257,536
Norms: 1,024ร2 = 2,048
Total/layer: ~10,618,880
88 layers ร ~10.6M = ~934M
+ Embeddings: ~33M
+ LM Head: ~33M
= ~1,000M total
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
GQA vs MHA Comparison
Multi-Head Attention (MHA):
Q: 8 heads, K: 8 heads, V: 8 heads
KV cache: 8 ร 128 ร 2 = 2,048 values per token
Grouped Query Attention (GQA) โ caca-1B-untrained:
Q: 8 heads, K: 1 head, V: 1 head
KV cache: 1 ร 128 ร 2 = 256 values per token
Saving: 87.5% โ memory for KV cache
โ๏ธ Konfigurasi Lengkap
{
"model_type": "caca",
"architectures": ["CacaForCausalLM"],
"vocab_size": 32000,
"hidden_size": 1024,
"intermediate_size": 2688,
"num_hidden_layers": 88,
"num_attention_heads": 8,
"num_key_value_heads": 1,
"head_dim": 128,
"max_position_embeddings": 4096,
"rope_theta": 10000.0,
"rms_norm_eps": 1e-6,
"use_qk_norm": true,
"use_flash_attn": true,
"use_rotary_embeddings": true,
"attention_dropout": 0.0,
"hidden_dropout": 0.1,
"residual_dropout": 0.1,
"use_cache": true,
"tie_word_embeddings": false
}
๐ ๏ธ Tips Training
# Recommended hyperparameters untuk caca-1B-untrained
learning_rate = 2e-4 # Base LR
warmup_ratio = 0.05 # 5% warmup
lr_scheduler = "cosine"
weight_decay = 0.1
max_grad_norm = 1.0
batch_size_effective = 256 # batch ร accum ร gpus
# GPU requirements:
# A100 40GB โ batch=2, accum=8, fp16
# RTX 3090 โ batch=1, accum=16, fp16 + grad_checkpoint
# RTX 4090 โ batch=1, accum=16, bf16 + grad_checkpoint
๐ง Troubleshooting
Out of Memory:
model.gradient_checkpointing_enable() # -40% memory
# + reduce batch size
# + load_in_8bit=True atau load_in_4bit=True
# + torch_dtype=torch.bfloat16
NaN Loss:
# Gunakan BF16 (lebih stable dari FP16)
torch_dtype = torch.bfloat16
# Atau kurangi learning rate 10x
Slow Training:
# Pastikan flash-attn terinstall
pip install flash-attn --no-build-isolation
# Compile model (PyTorch 2.0+)
model = torch.compile(model)
๐ License & Citation
Model ini dirilis di bawah Apache License 2.0 โ bebas digunakan untuk keperluan komersial maupun non-komersial dengan attribution.
@misc{caca1b,
author = {Lyon},
title = {caca-1B-untrained: Modern Transformer with Grouped Query Attention},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Lyon28/caca-1B-untrained}},
note = {Untrained model with ~1B parameters}
}
๐ Acknowledgments
- Flash Attention (Tri Dao et al.) โ IO-aware attention algorithm
- GQA (Ainslie et al., Google) โ Grouped Query Attention
- LLaMA (Meta AI) โ Decoder-only architecture inspiration
- RoPE (Su et al.) โ Rotary position embeddings
- SwiGLU (Shazeer) โ Gated linear unit activation
- ๐ค Hugging Face โ Transformers library & infrastructure
- Downloads last month
- 709