hunmin-13B-A2B-v0.11-base

13B total / ~2B active parameter Mixture-of-Experts (MoE) language model with custom GptOssForCausalLM architecture.

Model Architecture

Parameter Value
Architecture GptOssForCausalLM (decoder-only transformer)
Total Parameters ~13B
Active Parameters ~2B (top-2 routing)
Hidden Size 1536
Num Layers 42
Attention Heads 24 (GQA: 8 KV heads, head_dim=64)
Attention Type Hybrid sliding (window=128) / full attention (alternating layers)
MoE 40 routed experts + 1 shared expert, top_k=2
Vocab Size 200,001
Max Position 131,072 (YaRN RoPE, base 4096, factor 32)
Precision bfloat16

Tokenizer

Item Value
Type SpmDirectTokenizer (custom HuggingFace wrapper around SentencePiece)
Vocab File tokenizer.model (SentencePiece BPE, 200K vocab)
BOS Token <s> (id: 200000)
EOS/PAD Token <|endoftext|> (id: 1)

Important: SpmDirectTokenizer does NOT pre-split on whitespace (unlike LlamaTokenizer). Text is passed directly to SentencePiece, preserving native BPE merges including the โ– space prefix. The tokenizer code is in training/spm_tokenizer.py.

Usage

from spm_tokenizer import SpmDirectTokenizer

tokenizer = SpmDirectTokenizer(vocab_file="tokenizer.model", bos_token="<s>")
ids = tokenizer.encode("Hello world")

Note: Do NOT use AutoTokenizer with this model โ€” it will silently load a HuggingFace fast tokenizer that pre-splits on whitespace, producing different tokenization.

Training History

Phase 1: 4K Context Pre-training

Item Value
Context Length 4,096
Steps 99,000
Batch Size 8 per device
Gradient Accumulation 1
Nodes 5 nodes ร— 8 GPUs
Data FineWeb parquet shards (streaming)
Consumed Shards 0โ€“3,167 (3,168 shards, ~31.68M samples)
Final Loss ~1.67

Phase 2: 16K Context Extension

Item Value
Context Length 16,384
Steps 3,400
Batch Size 2 per device
Gradient Accumulation 1
Nodes 5 nodes ร— 8 GPUs
Learning Rate 5e-5 (cosine, warmup 100)
Consumed Shards 3,168โ€“3,194 (27 additional shards)
Note This phase used AutoTokenizer (HF fast tokenizer with whitespace pre-split), NOT SpmDirectTokenizer

Phase 3: 16K Tokenizer Remap

Item Value
Context Length 16,384
Steps 2,000
Batch Size 2 per device
Gradient Accumulation 1
Nodes 5 nodes ร— 8 GPUs
Learning Rate 5e-5 (cosine, warmup 100)
Tokenizer SpmDirectTokenizer (intended, but AutoTokenizer was actually loaded โ€” see note)
Consumed Shards 3,195โ€“3,210 (16 additional shards)
Final Loss ~1.63
Note Due to fallback logic in train.py, AutoTokenizer loaded successfully before reaching SpmDirectTokenizer. The train.py was subsequently fixed to prioritize SpmDirectTokenizer

Phase 4: 64K Context Extension (In Progress)

Item Value
Context Length 65,536
Steps 2,000 (target)
Batch Size 1 per device
Gradient Accumulation 1
Nodes 5 nodes ร— 8 GPUs
Learning Rate 5e-5 (cosine, warmup 100)
Tokenizer SpmDirectTokenizer (confirmed via log: Loaded SpmDirectTokenizer (vocab_size=200000))
Starting Shards 3,211 (resume from Phase 3)
Save Steps 500
Speed ~292s/step
FSDP Full shard, bf16 mixed precision, gradient checkpointing
Attention SDPA

Data

  • Dataset: FineWeb (parquet shards, streaming mode)
  • Path: text_corpora_fineweb_full_run02
  • Total Shards: 387,270
  • Rows per Shard: 10,000
  • Consumed Shards (as of Phase 3 end): 3,211 / 387,270

Data State (for resuming)

The data_state.json files track which shards have been consumed. To resume training from where it left off:

{
  "global_step": 2000,
  "consumed_samples": 32110000,
  "consumed_shards": 3211,
  "rows_per_shard": 10000,
  "seed": 42
}

Pass this via --resume_data_state <path_to_data_state.json> to avoid re-training on already consumed data.

Training Code

All training code is in the training/ directory:

File Description
train.py Main training script (HuggingFace Trainer + FSDP + streaming dataset)
data.py Dataset loading, tokenization, packing, data state management
spm_tokenizer.py SpmDirectTokenizer โ€” custom SPM wrapper without whitespace pre-split
moe_design_hunmin-fm-13B-A2B.json Model architecture design spec
trainjob-64k-remap-5node.yaml Kubeflow TrainJob YAML for 64K context training
data_state_16k_remap_2000.json Data state after Phase 3 (consumed_shards=3211)

How to Resume Training

  1. Use the latest checkpoint as --resume_from_checkpoint
  2. Use the corresponding data_state.json as --resume_data_state to skip consumed shards
  3. Ensure spm_tokenizer.py is in the same directory as train.py
  4. Key arguments:
    torchrun --nnodes=5 --nproc_per_node=8 ... train.py \
      --model_config_json moe_design_hunmin-fm-13B-A2B.json \
      --tokenizer_name_or_path <model_dir_with_tokenizer.model> \
      --train_path <parquet_data_dir> \
      --streaming --packing \
      --resume_from_checkpoint <checkpoint_dir> \
      --resume_data_state <data_state.json> \
      --attn_implementation sdpa \
      --max_length <context_length> \
      --per_device_train_batch_size <bs> \
      --gradient_accumulation_steps 1 \
      --gradient_checkpointing \
      --fsdp \
      --save_only_model
    

Environment

Item Value
Framework PyTorch 2.9.1, Transformers 5.x, Accelerate
GPU NVIDIA B200 (180GB), 8 per node
Nodes 5
Network InfiniBand (Mellanox mlx5)
FSDP Full shard, bf16 mixed precision
NCCL IB enabled, GDR level 8
Memory PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:256
Downloads last month
132
Safetensors
Model size
0.3B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support