Text Generation
Safetensors
Danish
English
llama
peter-sk commited on
Commit
e0aea51
·
1 Parent(s): 0d1a559

configs for reproducibility

Browse files
README.md CHANGED
@@ -12,13 +12,24 @@ pipeline_tag: text-generation
12
  ---
13
  # Munin-7B-Open-pt
14
 
15
- Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t/) using 30B tokens using a mix of the [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), comprising only public domain and openly licensed data.
 
16
  Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.
17
 
18
- Munin-7B-open-pt has been trained using the [maester](https://github.com/rlrs/maester) framework developed as part of the [Danish Foundation Models project](https://foundationmodels.dk/). The three pre-training stages are detailed in the following table:
 
 
 
19
 
20
  | Stage | Batch size | Steps | HF path | Data mix | Comments |
21
  |-|-|-|-|-|-|
22
  | stage1 | 262,144 tok | 37,852| [subfolder="stage1"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage1) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown |
23
  | stage2 | 524,288 tok | 18926 | [subfolder="stage2"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage2) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown |
24
  | stage3 | 524,288 tok | 18926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5 |
 
 
 
 
 
 
 
 
12
  ---
13
  # Munin-7B-Open-pt
14
 
15
+ Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t/) using 30B tokens using a mix of the [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.
16
+
17
  Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.
18
 
19
+ ## Training details
20
+ Munin-7B-open-pt has been trained using the [maester](https://github.com/rlrs/maester) framework developed as part of the [Danish Foundation Models project](https://foundationmodels.dk/). All training was performed on a single 8x Nvidia B200 node (the first of its kind in Denmark).
21
+
22
+ The training was performed in three stages, with data mix (open-stageK.py) and maester (open-stageK.toml) configuration files available in each subfolder. The three pre-training stages are detailed in the following table:
23
 
24
  | Stage | Batch size | Steps | HF path | Data mix | Comments |
25
  |-|-|-|-|-|-|
26
  | stage1 | 262,144 tok | 37,852| [subfolder="stage1"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage1) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown |
27
  | stage2 | 524,288 tok | 18926 | [subfolder="stage2"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage2) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown |
28
  | stage3 | 524,288 tok | 18926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5 |
29
+
30
+ ## Limitations
31
+
32
+ Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
33
+ It will likely have poor performance on other languages or programming languages.
34
+
35
+ As a base model, Munin-7B-Open-pt has not been aligned for safety and may, for example, reflect social biases present in its training data or potentially provide toxic or harmful information.
stage1/open-stage1.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dyna_train = {
2
+ "adl": 1.0,
3
+ "ai-aktindsigt": 1.0,
4
+ "botxt": 1.0,
5
+ "cellar": 1.0,
6
+ "dannet": 1.0,
7
+ "danske-taler": 1.0,
8
+ "domsdatabasen": 1.0,
9
+ "enevaeldens_nyheder": 1.0,
10
+ "ep": 1.0,
11
+ "eur-lex-sum-da": 1.0,
12
+ "fm-udgivelser": 1.0,
13
+ "ft": 1.0,
14
+ "grundtvig": 1.0,
15
+ "gutenberg": 1.0,
16
+ "health_hovedstaden": 1.0,
17
+ "hest": 1.0,
18
+ "historical-danish-handwriting": 1.0,
19
+ "memo": 1.0,
20
+ "miljoeportalen": 1.0,
21
+ "naat": 1.0,
22
+ "ncc_books": 1.0,
23
+ "ncc_maalfrid": 1.0,
24
+ "ncc_newspaper": 1.0,
25
+ "ncc_parliament": 1.0,
26
+ "nota": 1.0,
27
+ "opensubtitles": 1.0,
28
+ "relig": 1.0,
29
+ "retsinformationdk": 1.0,
30
+ "skat": 1.0,
31
+ "retspraksis": 1.0,
32
+ "spont": 1.0,
33
+ "tv2r": 1.0,
34
+ "wiki-comments": 1.0,
35
+ "wikibooks": 1.0,
36
+ "wikipedia": 1.0,
37
+ "wikisource": 1.0,
38
+ }
39
+
40
+ dyna_test = {
41
+ "depbank": 1.0,
42
+ "jvj": 1.0,
43
+ "nordjyllandnews": 1.0,
44
+ "synne": 1.0,
45
+ }
46
+
47
+ cp_train = {
48
+ "arxiv_papers": 0.5,
49
+ "cccc": 0.3,
50
+ "data_provenance_initiative": 2,
51
+ "doab": 2,
52
+ "foodista": 2,
53
+ "libretexts": 2,
54
+ "news": 2,
55
+ "oercommons": 2,
56
+ "peS2o": 0.1,
57
+ "pressbooks": 2,
58
+ "public_domain_review": 2,
59
+ "python_enhancement_proposals": 2,
60
+ "stackexchange": 0.25,
61
+ "stackv2_edu": 0.1,
62
+ "wikimedia": 0.4,
63
+ }
64
+
65
+ sources = {
66
+ "dyna": {
67
+ "uri": "hf://datasets/danish-foundation-models/danish-dynaword/data/{key}/*.parquet",
68
+ "format": "parquet",
69
+ "shards": 1,
70
+ "shard_index": 0,
71
+ "train": dyna_train,
72
+ "test": dyna_test,
73
+ },
74
+ "cp": {
75
+ "uri": "hf://datasets/common-pile/comma_v0.1_training_dataset/{key}/*.jsonl.gz",
76
+ "format": "json",
77
+ "shards": 16,
78
+ "shard_index": 0,
79
+ "train": cp_train,
80
+ "test": {},
81
+ },
82
+ }
stage1/open-stage1.toml ADDED
@@ -0,0 +1,42 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name = "llama3"
2
+ flavor = "Comma7B"
3
+ tokenizer_name = "common-pile/comma-v0.1-2t"
4
+
5
+ # job
6
+ job_name = "munin-7b-open-stage1"
7
+ wandb_project = "munin-7b-open-stage1"
8
+ enable_wandb = false
9
+
10
+ # parallelism
11
+ num_nodes = 1
12
+ data_parallel_shard_degree = 8
13
+ data_parallel_replicate_degree = 1
14
+
15
+ # training settings
16
+ train_batch_size = 8
17
+ seq_len = 4096
18
+ train_num_steps = 37852
19
+ scheduler = "linear_warmup_constant_sqrt_decay"
20
+ warmup_steps = 1000
21
+ cooldown_steps = 1000
22
+ checkpoint_interval = 1000
23
+ forced_load_path = "/work/training/maester/comma-v0.1-2t-dcp/"
24
+ compile = true
25
+ enable_cut_cross_entropy = false
26
+ ac_mode = "none"
27
+ selective_ac_option = "op"
28
+
29
+ [dataset]
30
+ bos_token = 2
31
+ eos_token = 1
32
+ data_dirs = [
33
+ "/work/production/data/munin-open-dyna-0-of-1-cp-0-of-16-train/",
34
+ ]
35
+ dataset_weights = "1.0"
36
+
37
+ [opt_cfg] # must specify *all* fields here, will not merge with defaults
38
+ lr = 1e-5
39
+ betas = [0.9, 0.95]
40
+ weight_decay = 0.1
41
+ eps = 1e-9
42
+ fused = true
stage2/open-stage2.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dyna_train = {
2
+ "adl": 1.0,
3
+ "ai-aktindsigt": 1.0,
4
+ "botxt": 1.0,
5
+ "cellar": 1.0,
6
+ "dannet": 1.0,
7
+ "danske-taler": 1.0,
8
+ "domsdatabasen": 1.0,
9
+ "enevaeldens_nyheder": 1.0,
10
+ "ep": 1.0,
11
+ "eur-lex-sum-da": 1.0,
12
+ "fm-udgivelser": 1.0,
13
+ "ft": 1.0,
14
+ "grundtvig": 1.0,
15
+ "gutenberg": 1.0,
16
+ "health_hovedstaden": 1.0,
17
+ "hest": 1.0,
18
+ "historical-danish-handwriting": 1.0,
19
+ "memo": 1.0,
20
+ "miljoeportalen": 1.0,
21
+ "naat": 1.0,
22
+ "ncc_books": 1.0,
23
+ "ncc_maalfrid": 1.0,
24
+ "ncc_newspaper": 1.0,
25
+ "ncc_parliament": 1.0,
26
+ "nota": 1.0,
27
+ "opensubtitles": 1.0,
28
+ "relig": 1.0,
29
+ "retsinformationdk": 1.0,
30
+ "skat": 1.0,
31
+ "retspraksis": 1.0,
32
+ "spont": 1.0,
33
+ "tv2r": 1.0,
34
+ "wiki-comments": 1.0,
35
+ "wikibooks": 1.0,
36
+ "wikipedia": 1.0,
37
+ "wikisource": 1.0,
38
+ }
39
+
40
+ dyna_test = {
41
+ "depbank": 1.0,
42
+ "jvj": 1.0,
43
+ "nordjyllandnews": 1.0,
44
+ "synne": 1.0,
45
+ }
46
+
47
+ cp_train = {
48
+ "arxiv_papers": 0.5,
49
+ "cccc": 0.3,
50
+ "data_provenance_initiative": 2,
51
+ "doab": 2,
52
+ "foodista": 2,
53
+ "libretexts": 2,
54
+ "news": 2,
55
+ "oercommons": 2,
56
+ "peS2o": 0.1,
57
+ "pressbooks": 2,
58
+ "public_domain_review": 2,
59
+ "python_enhancement_proposals": 2,
60
+ "stackexchange": 0.25,
61
+ "stackv2_edu": 0.1,
62
+ "wikimedia": 0.4,
63
+ }
64
+
65
+ sources = {
66
+ "dyna": {
67
+ "uri": "hf://datasets/danish-foundation-models/danish-dynaword/data/{key}/*.parquet",
68
+ "format": "parquet",
69
+ "shards": 1,
70
+ "shard_index": 0,
71
+ "train": dyna_train,
72
+ "test": dyna_test,
73
+ },
74
+ "cp": {
75
+ "uri": "hf://datasets/common-pile/comma_v0.1_training_dataset/{key}/*.jsonl.gz",
76
+ "format": "json",
77
+ "shards": 16,
78
+ "shard_index": 1,
79
+ "train": cp_train,
80
+ "test": {},
81
+ },
82
+ }
stage2/open-stage2.toml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name = "llama3"
2
+ flavor = "Comma7B"
3
+ tokenizer_name = "common-pile/comma-v0.1-2t"
4
+
5
+ # job
6
+ job_name = "munin-7b-open-stage2"
7
+ wandb_project = "munin-7b-open-stage2"
8
+ enable_wandb = false
9
+
10
+ # parallelism
11
+ num_nodes = 1
12
+ data_parallel_shard_degree = 8
13
+ data_parallel_replicate_degree = 1
14
+
15
+ # training settings
16
+ train_batch_size = 8
17
+ gradient_accumulation_steps = 2
18
+ gradient_accumulation_sync_each_step = true
19
+ seq_len = 4096
20
+ train_num_steps = 18926 # 37852 // 2
21
+ scheduler = "linear_warmup_constant_sqrt_decay"
22
+ warmup_steps = 500
23
+ cooldown_steps = 500
24
+ checkpoint_interval = 1000
25
+ forced_load_path = "/work/training/maester/jobs/munin-7b-open-stage1/checkpoints/step-37852/"
26
+ compile = true
27
+ enable_cut_cross_entropy = false
28
+ ac_mode = "none"
29
+ selective_ac_option = "op"
30
+
31
+ [dataset]
32
+ bos_token = 2
33
+ eos_token = 1
34
+ data_dirs = [
35
+ "/work/production/data/dsk-open-dyna-0-of-1-cp-1-of-16-train/",
36
+ ]
37
+ dataset_weights = "1.0"
38
+
39
+ [opt_cfg] # must specify *all* fields here, will not merge with defaults
40
+ lr = 1e-5
41
+ betas = [0.9, 0.95]
42
+ weight_decay = 0.1
43
+ eps = 1e-9
44
+ fused = true
stage3/open-stage3.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ dyna_train = {
2
+ "adl": 1.0,
3
+ "ai-aktindsigt": 1.0,
4
+ "botxt": 1.0,
5
+ "cellar": 1.0,
6
+ "dannet": 1.0,
7
+ "danske-taler": 1.0,
8
+ "domsdatabasen": 1.0,
9
+ "enevaeldens_nyheder": 1.0,
10
+ "ep": 1.0,
11
+ "eur-lex-sum-da": 1.0,
12
+ "fm-udgivelser": 1.0,
13
+ "ft": 1.0,
14
+ "grundtvig": 1.0,
15
+ "gutenberg": 1.0,
16
+ "health_hovedstaden": 1.0,
17
+ "hest": 1.0,
18
+ "historical-danish-handwriting": 1.0,
19
+ "memo": 1.0,
20
+ "miljoeportalen": 1.0,
21
+ "naat": 1.0,
22
+ "ncc_books": 1.0,
23
+ "ncc_maalfrid": 1.0,
24
+ "ncc_newspaper": 1.0,
25
+ "ncc_parliament": 1.0,
26
+ "nota": 1.0,
27
+ "opensubtitles": 1.0,
28
+ "relig": 1.0,
29
+ "retsinformationdk": 1.0,
30
+ "skat": 1.0,
31
+ "retspraksis": 1.0,
32
+ "spont": 1.0,
33
+ "tv2r": 1.0,
34
+ "wiki-comments": 1.0,
35
+ "wikibooks": 1.0,
36
+ "wikipedia": 1.0,
37
+ "wikisource": 1.0,
38
+ }
39
+
40
+ dyna_test = {
41
+ "depbank": 1.0,
42
+ "jvj": 1.0,
43
+ "nordjyllandnews": 1.0,
44
+ "synne": 1.0,
45
+ }
46
+
47
+ cp_train = {
48
+ "arxiv_papers": 0.5,
49
+ "cccc": 0.3,
50
+ "data_provenance_initiative": 2,
51
+ "doab": 2,
52
+ "foodista": 2,
53
+ "libretexts": 2,
54
+ "news": 2,
55
+ "oercommons": 2,
56
+ "peS2o": 0.1,
57
+ "pressbooks": 2,
58
+ "public_domain_review": 2,
59
+ "python_enhancement_proposals": 2,
60
+ "stackexchange": 0.25,
61
+ "stackv2_edu": 0.1,
62
+ "wikimedia": 0.4,
63
+ }
64
+
65
+ sources = {
66
+ "dyna": {
67
+ "uri": "hf://datasets/danish-foundation-models/danish-dynaword/data/{key}/*.parquet",
68
+ "format": "parquet",
69
+ "shards": 1,
70
+ "shard_index": 0,
71
+ "train": dyna_train,
72
+ "test": dyna_test,
73
+ },
74
+ "cp": {
75
+ "uri": "hf://datasets/common-pile/comma_v0.1_training_dataset/{key}/*.jsonl.gz",
76
+ "format": "json",
77
+ "shards": 16,
78
+ "shard_index": 2,
79
+ "train": cp_train,
80
+ "test": {},
81
+ },
82
+ }
stage3/open-stage3.toml ADDED
@@ -0,0 +1,44 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ model_name = "llama3"
2
+ flavor = "Comma7B"
3
+ tokenizer_name = "common-pile/comma-v0.1-2t"
4
+
5
+ # job
6
+ job_name = "munin-7b-open-stage3"
7
+ wandb_project = "munin-7b-open-stage3"
8
+ enable_wandb = false
9
+
10
+ # parallelism
11
+ num_nodes = 1
12
+ data_parallel_shard_degree = 8
13
+ data_parallel_replicate_degree = 1
14
+
15
+ # training settings
16
+ train_batch_size = 8
17
+ gradient_accumulation_steps = 2
18
+ gradient_accumulation_sync_each_step = true
19
+ seq_len = 4096
20
+ train_num_steps = 18926 # 37852 // 2
21
+ scheduler = "linear_warmup_constant_sqrt_decay"
22
+ warmup_steps = 500
23
+ cooldown_steps = 18426
24
+ checkpoint_interval = 1000
25
+ forced_load_path = "/work/training/maester/jobs/munin-7b-open-stage2/checkpoints/step-18926/"
26
+ compile = true
27
+ enable_cut_cross_entropy = false
28
+ ac_mode = "none"
29
+ selective_ac_option = "op"
30
+
31
+ [dataset]
32
+ bos_token = 2
33
+ eos_token = 1
34
+ data_dirs = [
35
+ "/work/production/data/dsk-open-dyna-0-of-1-cp-2-of-16-train/",
36
+ ]
37
+ dataset_weights = "1.0"
38
+
39
+ [opt_cfg] # must specify *all* fields here, will not merge with defaults
40
+ lr = 1e-5
41
+ betas = [0.9, 0.95]
42
+ weight_decay = 0.1
43
+ eps = 1e-9
44
+ fused = true