license: apache-2.0
datasets:
- danish-foundation-models/danish-dynaword
- common-pile/comma_v0.1_training_dataset
language:
- da
- en
base_model:
- common-pile/comma-v0.1-2t
pipeline_tag: text-generation
Munin-7B-Open-pt
Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from Comma v0.1-2T using 30B tokens using a mix of the Dynaword and the Comma v0.1 dataset, both comprising only public domain and openly licensed data.
Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.
Training details
Munin-7B-open-pt has been trained using the maester framework developed as part of the Danish Foundation Models project. All training was performed on a single 8x Nvidia B200 node (the first of its kind in Denmark).
The training was performed in three stages, with data mix (open-stageK.py) and maester (open-stageK.toml) configuration files available in each subfolder. The datasets can be created using the create_dataset.py script provided in this repository.
The characteristics of the three pre-training stages are detailed in the following table:
| Stage | Batch size | Steps | HF path | Data mix | Comments |
|---|---|---|---|---|---|
| stage1 | 262,144 tok | 37,852 | subfolder="stage1" | 2/3 DynaWord; 1/3 Common-Pile |
Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown |
| stage2 | 524,288 tok | 18926 | subfolder="stage2" | 2/3 DynaWord; 1/3 Common-Pile |
Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown |
| stage3 | 524,288 tok | 18926 | subfolder="stage3" | 2/3 DynaWord; 1/3 Common-Pile |
Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5 |
Limitations
Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the stack-edu classifiers. It will likely have poor performance on other languages or programming languages.
As a base model, Munin-7B-Open-pt has not been aligned for safety and may, for example, reflect social biases present in its training data or potentially provide toxic or harmful information.