Text Generation
Safetensors
Danish
English
llama
peter-sk's picture
updated model card
b6db049
|
raw
history blame
4.23 kB
metadata
license: apache-2.0
datasets:
  - danish-foundation-models/danish-dynaword
  - common-pile/comma_v0.1_training_dataset
language:
  - da
  - en
base_model:
  - common-pile/comma-v0.1-2t
pipeline_tag: text-generation

Munin-7B-Open-pt

Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from Comma v0.1-2T using 30B tokens using a mix of the Dynaword and the Comma v0.1 dataset, both comprising only public domain and openly licensed data.

Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.

Training details

Munin-7B-open-pt has been trained using the maester framework developed as part of the Danish Foundation Models project. All training was performed on a single 8x Nvidia B200 node (the first of its kind in Denmark).

The training was performed in three stages, with data mix (open-stageK.py) and maester (open-stageK.toml) configuration files available in each subfolder. The datasets can be created using the create_dataset.py script provided in this repository.

The characteristics of the three pre-training stages are detailed in the following table:

Stage Batch size Steps HF path Data mix Comments
stage1 262,144 tok 37,852 subfolder="stage1" 2/3 DynaWord;
1/3 Common-Pile
Excludes depbank, jvj, nordjyllandnews, synne for DynaWord;
uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown
stage2 524,288 tok 18926 subfolder="stage2" 2/3 DynaWord;
1/3 Common-Pile
Excludes depbank, jvj, nordjyllandnews, synne for DynaWord;
uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown
stage3 524,288 tok 18926 subfolder="stage3" 2/3 DynaWord;
1/3 Common-Pile
Excludes depbank, jvj, nordjyllandnews, synne for DynaWord;
uses subsets and weighting from Comma-v0.1-2T cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5

Limitations

Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the stack-edu classifiers. It will likely have poor performance on other languages or programming languages.

As a base model, Munin-7B-Open-pt has not been aligned for safety and may, for example, reflect social biases present in its training data or potentially provide toxic or harmful information.