hunmin-13B-A2B-v0.11-base
13B total / ~2B active parameter Mixture-of-Experts (MoE) language model with custom GptOssForCausalLM architecture.
Model Architecture
| Parameter |
Value |
| Architecture |
GptOssForCausalLM (decoder-only transformer) |
| Total Parameters |
~13B |
| Active Parameters |
~2B (top-2 routing) |
| Hidden Size |
1536 |
| Num Layers |
42 |
| Attention Heads |
24 (GQA: 8 KV heads, head_dim=64) |
| Attention Type |
Hybrid sliding (window=128) / full attention (alternating layers) |
| MoE |
40 routed experts + 1 shared expert, top_k=2 |
| Vocab Size |
200,001 |
| Max Position |
131,072 (YaRN RoPE, base 4096, factor 32) |
| Precision |
bfloat16 |
Tokenizer
| Item |
Value |
| Type |
SpmDirectTokenizer (custom HuggingFace wrapper around SentencePiece) |
| Vocab File |
tokenizer.model (SentencePiece BPE, 200K vocab) |
| BOS Token |
<s> (id: 200000) |
| EOS/PAD Token |
<|endoftext|> (id: 1) |
Important: SpmDirectTokenizer does NOT pre-split on whitespace (unlike LlamaTokenizer). Text is passed directly to SentencePiece, preserving native BPE merges including the โ space prefix. The tokenizer code is in training/spm_tokenizer.py.
Usage
from spm_tokenizer import SpmDirectTokenizer
tokenizer = SpmDirectTokenizer(vocab_file="tokenizer.model", bos_token="<s>")
ids = tokenizer.encode("Hello world")
Note: Do NOT use AutoTokenizer with this model โ it will silently load a HuggingFace fast tokenizer that pre-splits on whitespace, producing different tokenization.
Training History
Phase 1: 4K Context Pre-training
| Item |
Value |
| Context Length |
4,096 |
| Steps |
99,000 |
| Batch Size |
8 per device |
| Gradient Accumulation |
1 |
| Nodes |
5 nodes ร 8 GPUs |
| Data |
FineWeb parquet shards (streaming) |
| Consumed Shards |
0โ3,167 (3,168 shards, ~31.68M samples) |
| Final Loss |
~1.67 |
Phase 2: 16K Context Extension
| Item |
Value |
| Context Length |
16,384 |
| Steps |
3,400 |
| Batch Size |
2 per device |
| Gradient Accumulation |
1 |
| Nodes |
5 nodes ร 8 GPUs |
| Learning Rate |
5e-5 (cosine, warmup 100) |
| Consumed Shards |
3,168โ3,194 (27 additional shards) |
| Note |
This phase used AutoTokenizer (HF fast tokenizer with whitespace pre-split), NOT SpmDirectTokenizer |
Phase 3: 16K Tokenizer Remap
| Item |
Value |
| Context Length |
16,384 |
| Steps |
2,000 |
| Batch Size |
2 per device |
| Gradient Accumulation |
1 |
| Nodes |
5 nodes ร 8 GPUs |
| Learning Rate |
5e-5 (cosine, warmup 100) |
| Tokenizer |
SpmDirectTokenizer (intended, but AutoTokenizer was actually loaded โ see note) |
| Consumed Shards |
3,195โ3,210 (16 additional shards) |
| Final Loss |
~1.63 |
| Note |
Due to fallback logic in train.py, AutoTokenizer loaded successfully before reaching SpmDirectTokenizer. The train.py was subsequently fixed to prioritize SpmDirectTokenizer |
Phase 4: 64K Context Extension (In Progress)
| Item |
Value |
| Context Length |
65,536 |
| Steps |
2,000 (target) |
| Batch Size |
1 per device |
| Gradient Accumulation |
1 |
| Nodes |
5 nodes ร 8 GPUs |
| Learning Rate |
5e-5 (cosine, warmup 100) |
| Tokenizer |
SpmDirectTokenizer (confirmed via log: Loaded SpmDirectTokenizer (vocab_size=200000)) |
| Starting Shards |
3,211 (resume from Phase 3) |
| Save Steps |
500 |
| Speed |
~292s/step |
| FSDP |
Full shard, bf16 mixed precision, gradient checkpointing |
| Attention |
SDPA |
Data
- Dataset: FineWeb (parquet shards, streaming mode)
- Path:
text_corpora_fineweb_full_run02
- Total Shards: 387,270
- Rows per Shard: 10,000
- Consumed Shards (as of Phase 3 end): 3,211 / 387,270
Data State (for resuming)
The data_state.json files track which shards have been consumed. To resume training from where it left off:
{
"global_step": 2000,
"consumed_samples": 32110000,
"consumed_shards": 3211,
"rows_per_shard": 10000,
"seed": 42
}
Pass this via --resume_data_state <path_to_data_state.json> to avoid re-training on already consumed data.
Training Code
All training code is in the training/ directory:
| File |
Description |
train.py |
Main training script (HuggingFace Trainer + FSDP + streaming dataset) |
data.py |
Dataset loading, tokenization, packing, data state management |
spm_tokenizer.py |
SpmDirectTokenizer โ custom SPM wrapper without whitespace pre-split |
moe_design_hunmin-fm-13B-A2B.json |
Model architecture design spec |
trainjob-64k-remap-5node.yaml |
Kubeflow TrainJob YAML for 64K context training |
data_state_16k_remap_2000.json |
Data state after Phase 3 (consumed_shards=3211) |
How to Resume Training
- Use the latest checkpoint as
--resume_from_checkpoint
- Use the corresponding
data_state.json as --resume_data_state to skip consumed shards
- Ensure
spm_tokenizer.py is in the same directory as train.py
- Key arguments:
torchrun --nnodes=5 --nproc_per_node=8 ... train.py \
--model_config_json moe_design_hunmin-fm-13B-A2B.json \
--tokenizer_name_or_path <model_dir_with_tokenizer.model> \
--train_path <parquet_data_dir> \
--streaming --packing \
--resume_from_checkpoint <checkpoint_dir> \
--resume_data_state <data_state.json> \
--attn_implementation sdpa \
--max_length <context_length> \
--per_device_train_batch_size <bs> \
--gradient_accumulation_steps 1 \
--gradient_checkpointing \
--fsdp \
--save_only_model
Environment
| Item |
Value |
| Framework |
PyTorch 2.9.1, Transformers 5.x, Accelerate |
| GPU |
NVIDIA B200 (180GB), 8 per node |
| Nodes |
5 |
| Network |
InfiniBand (Mellanox mlx5) |
| FSDP |
Full shard, bf16 mixed precision |
| NCCL |
IB enabled, GDR level 8 |
| Memory |
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,max_split_size_mb:256 |