updated model card

b6db049 4 months ago

4.23 kB

license: apache-2.0
datasets:
  - danish-foundation-models/danish-dynaword
  - common-pile/comma_v0.1_training_dataset
language:
  - da
  - en
base_model:
  - common-pile/comma-v0.1-2t
pipeline_tag: text-generation

Munin-7B-Open-pt

Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from Comma v0.1-2T using 30B tokens using a mix of the Dynaword and the Comma v0.1 dataset, both comprising only public domain and openly licensed data.

Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.

Training details

Munin-7B-open-pt has been trained using the maester framework developed as part of the Danish Foundation Models project. All training was performed on a single 8x Nvidia B200 node (the first of its kind in Denmark).

The training was performed in three stages, with data mix (open-stageK.py) and maester (open-stageK.toml) configuration files available in each subfolder. The datasets can be created using the create_dataset.py script provided in this repository.

The characteristics of the three pre-training stages are detailed in the following table:

Stage	Batch size	Steps	HF path	Data mix	Comments
stage1	262,144 tok	37,852	subfolder="stage1"	2/3 DynaWord; 1/3 Common-Pile	Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown
stage2	524,288 tok	18926	subfolder="stage2"	2/3 DynaWord; 1/3 Common-Pile	Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown
stage3	524,288 tok	18926	subfolder="stage3"	2/3 DynaWord; 1/3 Common-Pile	Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5

Limitations

Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the stack-edu classifiers. It will likely have poor performance on other languages or programming languages.

As a base model, Munin-7B-Open-pt has not been aligned for safety and may, for example, reflect social biases present in its training data or potentially provide toxic or harmful information.