hunmin-13B-A2B-v0.11-base

13B total / ~2B active parameter Mixture-of-Experts (MoE) language model with custom GptOssForCausalLM architecture.

Model Architecture

Parameter	Value
Architecture	`GptOssForCausalLM` (decoder-only transformer)
Total Parameters	~13B
Active Parameters	~2B (top-2 routing)
Hidden Size	1536
Num Layers	42
Attention Heads	24 (GQA: 8 KV heads, head_dim=64)
Attention Type	Hybrid sliding (window=128) / full attention (alternating layers)
MoE	40 routed experts + 1 shared expert, top_k=2
Vocab Size	200,001
Max Position	131,072 (YaRN RoPE, base 4096, factor 32)
Precision	bfloat16

Tokenizer

Item	Value
Type	`SpmDirectTokenizer` (custom HuggingFace wrapper around SentencePiece)
Vocab File	`tokenizer.model` (SentencePiece BPE, 200K vocab)
BOS Token	`<s>` (id: 200000)
EOS/PAD Token	`<\|endoftext\|>` (id: 1)

Important: SpmDirectTokenizer does NOT pre-split on whitespace (unlike LlamaTokenizer). Text is passed directly to SentencePiece, preserving native BPE merges including the ▁ space prefix. The tokenizer code is in training/spm_tokenizer.py.

Usage

from spm_tokenizer import SpmDirectTokenizer

tokenizer = SpmDirectTokenizer(vocab_file="tokenizer.model", bos_token="<s>")
ids = tokenizer.encode("Hello world")

Note: Do NOT use AutoTokenizer with this model — it will silently load a HuggingFace fast tokenizer that pre-splits on whitespace, producing different tokenization.

Training History

Phase 1: 4K Context Pre-training

Item	Value
Context Length	4,096
Steps	99,000
Batch Size	8 per device
Gradient Accumulation	1
Nodes	5 nodes × 8 GPUs
Data	FineWeb parquet shards (streaming)
Consumed Shards	0–3,167 (3,168 shards, ~31.68M samples)
Final Loss	~1.67

Phase 2: 16K Context Extension

Item	Value
Context Length	16,384
Steps	3,400
Batch Size	2 per device
Gradient Accumulation	1
Nodes	5 nodes × 8 GPUs
Learning Rate	5e-5 (cosine, warmup 100)
Consumed Shards	3,168–3,194 (27 additional shards)
Note	This phase used `AutoTokenizer` (HF fast tokenizer with whitespace pre-split), NOT `SpmDirectTokenizer`

Phase 3: 16K Tokenizer Remap

Item	Value
Context Length	16,384
Steps	2,000
Batch Size	2 per device
Gradient Accumulation	1
Nodes	5 nodes × 8 GPUs
Learning Rate	5e-5 (cosine, warmup 100)
Tokenizer	`SpmDirectTokenizer` (intended, but `AutoTokenizer` was actually loaded — see note)
Consumed Shards	3,195–3,210 (16 additional shards)
Final Loss	~1.63
Note	Due to fallback logic in `train.py`, `AutoTokenizer` loaded successfully before reaching `SpmDirectTokenizer`. The `train.py` was subsequently fixed to prioritize `SpmDirectTokenizer`

Phase 4: 64K Context Extension (In Progress)

Item	Value
Context Length	65,536
Steps	2,000 (target)
Batch Size	1 per device
Gradient Accumulation	1
Nodes	5 nodes × 8 GPUs
Learning Rate	5e-5 (cosine, warmup 100)
Tokenizer	`SpmDirectTokenizer` (confirmed via log: `Loaded SpmDirectTokenizer (vocab_size=200000)`)
Starting Shards	3,211 (resume from Phase 3)
Save Steps	500
Speed	~292s/step
FSDP	Full shard, bf16 mixed precision, gradient checkpointing
Attention	SDPA

Data

Dataset: FineWeb (parquet shards, streaming mode)
Path: text_corpora_fineweb_full_run02
Total Shards: 387,270
Rows per Shard: 10,000
Consumed Shards (as of Phase 3 end): 3,211 / 387,270

Data State (for resuming)

The data_state.json files track which shards have been consumed. To resume training from where it left off:

{
  "global_step": 2000,
  "consumed_samples": 32110000,
  "consumed_shards": 3211,
  "rows_per_shard": 10000,
  "seed": 42
}

Pass this via --resume_data_state <path_to_data_state.json> to avoid re-training on already consumed data.

Training Code

All training code is in the training/ directory:

File	Description
`train.py`	Main training script (HuggingFace Trainer + FSDP + streaming dataset)
`data.py`	Dataset loading, tokenization, packing, data state management
`spm_tokenizer.py`	`SpmDirectTokenizer` — custom SPM wrapper without whitespace pre-split
`moe_design_hunmin-fm-13B-A2B.json`	Model architecture design spec
`trainjob-64k-remap-5node.yaml`	Kubeflow TrainJob YAML for 64K context training
`data_state_16k_remap_2000.json`	Data state after Phase 3 (consumed_shards=3211)

How to Resume Training

Use the latest checkpoint as --resume_from_checkpoint
Use the corresponding data_state.json as --resume_data_state to skip consumed shards
Ensure spm_tokenizer.py is in the same directory as train.py

Key arguments:

torchrun --nnodes=5 --nproc_per_node=8 ... train.py \
  --model_config_json moe_design_hunmin-fm-13B-A2B.json \
  --tokenizer_name_or_path <model_dir_with_tokenizer.model> \
  --train_path <parquet_data_dir> \
  --streaming --packing \
  --resume_from_checkpoint <checkpoint_dir> \
  --resume_data_state <data_state.json> \
  --attn_implementation sdpa \
  --max_length <context_length> \
  --per_device_train_batch_size <bs> \
  --gradient_accumulation_steps 1 \
  --gradient_checkpointing \
  --fsdp \
  --save_only_model

Environment

Item	Value
Framework	PyTorch 2.9.1, Transformers 5.x, Accelerate
GPU	NVIDIA B200 (180GB), 8 per node
Nodes	5
Network	InfiniBand (Mellanox mlx5)
FSDP	Full shard, bf16 mixed precision
NCCL	IB enabled, GDR level 8
Memory	`PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:256`

Downloads last month: 132

Safetensors

Model size

0.3B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support