configs for reproducibility
Browse files- README.md +13 -2
- stage1/open-stage1.py +82 -0
- stage1/open-stage1.toml +42 -0
- stage2/open-stage2.py +82 -0
- stage2/open-stage2.toml +44 -0
- stage3/open-stage3.py +82 -0
- stage3/open-stage3.toml +44 -0
README.md
CHANGED
|
@@ -12,13 +12,24 @@ pipeline_tag: text-generation
|
|
| 12 |
---
|
| 13 |
# Munin-7B-Open-pt
|
| 14 |
|
| 15 |
-
Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t/) using 30B tokens using a mix of the [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), comprising only public domain and openly licensed data.
|
|
|
|
| 16 |
Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.
|
| 17 |
|
| 18 |
-
|
|
|
|
|
|
|
|
|
|
| 19 |
|
| 20 |
| Stage | Batch size | Steps | HF path | Data mix | Comments |
|
| 21 |
|-|-|-|-|-|-|
|
| 22 |
| stage1 | 262,144 tok | 37,852| [subfolder="stage1"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage1) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown |
|
| 23 |
| stage2 | 524,288 tok | 18926 | [subfolder="stage2"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage2) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown |
|
| 24 |
| stage3 | 524,288 tok | 18926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 12 |
---
|
| 13 |
# Munin-7B-Open-pt
|
| 14 |
|
| 15 |
+
Munin-7B-open-pt is a 7 billion parameter language model continually pre-trained from [Comma v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t/) using 30B tokens using a mix of the [Dynaword](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword) and [the Comma v0.1 dataset](https://huggingface.co/datasets/common-pile/comma_v0.1_training_dataset), both comprising only public domain and openly licensed data.
|
| 16 |
+
|
| 17 |
Munin-7B-open-pt is a base model that can be used a the starting point for fine-tuning and post-training. It has not been instruction-tuned and cannot directly be expected to function as a chat model.
|
| 18 |
|
| 19 |
+
## Training details
|
| 20 |
+
Munin-7B-open-pt has been trained using the [maester](https://github.com/rlrs/maester) framework developed as part of the [Danish Foundation Models project](https://foundationmodels.dk/). All training was performed on a single 8x Nvidia B200 node (the first of its kind in Denmark).
|
| 21 |
+
|
| 22 |
+
The training was performed in three stages, with data mix (open-stageK.py) and maester (open-stageK.toml) configuration files available in each subfolder. The three pre-training stages are detailed in the following table:
|
| 23 |
|
| 24 |
| Stage | Batch size | Steps | HF path | Data mix | Comments |
|
| 25 |
|-|-|-|-|-|-|
|
| 26 |
| stage1 | 262,144 tok | 37,852| [subfolder="stage1"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage1) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile ; LR swchedule with 1000 steps warmup, constant 1e-5, 1000 steps cooldown |
|
| 27 |
| stage2 | 524,288 tok | 18926 | [subfolder="stage2"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage2) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, constant 1e-5, 500 steps cooldown |
|
| 28 |
| stage3 | 524,288 tok | 18926 | [subfolder="stage3"](https://huggingface.co/danish-foundation-models/munin-7b-open-pt/tree/main/stage3) | 2/3 [DynaWord](https://huggingface.co/datasets/danish-foundation-models/danish-dynaword/tree/9e230b35e31a510e5ab909112ad5bfc9463b2c23); <br> 1/3 [Common-Pile](https://huggingface.co/common-pile/comma_v0.1_training_dataset/5afc546db324e7f39f297ba757c9a60547151e7c/) | Excludes depbank, jvj, nordjyllandnews, synne for DynaWord; <br> uses subsets and weighting from [Comma-v0.1-2T](https://huggingface.co/common-pile/comma-v0.1-2t) cooldown phase for Common-Pile; LR swchedule with 500 steps warmup, square root decay from 1e-5 |
|
| 29 |
+
|
| 30 |
+
## Limitations
|
| 31 |
+
|
| 32 |
+
Munin-7B-Open-pt was trained only on Danish and English-language data and code from the 15 programming languages covered by the [stack-edu classifiers](https://huggingface.co/collections/HuggingFaceTB/the-ultimate-collection-of-code-classifiers-67b5aa3eb8994a4b71453005).
|
| 33 |
+
It will likely have poor performance on other languages or programming languages.
|
| 34 |
+
|
| 35 |
+
As a base model, Munin-7B-Open-pt has not been aligned for safety and may, for example, reflect social biases present in its training data or potentially provide toxic or harmful information.
|
stage1/open-stage1.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
dyna_train = {
|
| 2 |
+
"adl": 1.0,
|
| 3 |
+
"ai-aktindsigt": 1.0,
|
| 4 |
+
"botxt": 1.0,
|
| 5 |
+
"cellar": 1.0,
|
| 6 |
+
"dannet": 1.0,
|
| 7 |
+
"danske-taler": 1.0,
|
| 8 |
+
"domsdatabasen": 1.0,
|
| 9 |
+
"enevaeldens_nyheder": 1.0,
|
| 10 |
+
"ep": 1.0,
|
| 11 |
+
"eur-lex-sum-da": 1.0,
|
| 12 |
+
"fm-udgivelser": 1.0,
|
| 13 |
+
"ft": 1.0,
|
| 14 |
+
"grundtvig": 1.0,
|
| 15 |
+
"gutenberg": 1.0,
|
| 16 |
+
"health_hovedstaden": 1.0,
|
| 17 |
+
"hest": 1.0,
|
| 18 |
+
"historical-danish-handwriting": 1.0,
|
| 19 |
+
"memo": 1.0,
|
| 20 |
+
"miljoeportalen": 1.0,
|
| 21 |
+
"naat": 1.0,
|
| 22 |
+
"ncc_books": 1.0,
|
| 23 |
+
"ncc_maalfrid": 1.0,
|
| 24 |
+
"ncc_newspaper": 1.0,
|
| 25 |
+
"ncc_parliament": 1.0,
|
| 26 |
+
"nota": 1.0,
|
| 27 |
+
"opensubtitles": 1.0,
|
| 28 |
+
"relig": 1.0,
|
| 29 |
+
"retsinformationdk": 1.0,
|
| 30 |
+
"skat": 1.0,
|
| 31 |
+
"retspraksis": 1.0,
|
| 32 |
+
"spont": 1.0,
|
| 33 |
+
"tv2r": 1.0,
|
| 34 |
+
"wiki-comments": 1.0,
|
| 35 |
+
"wikibooks": 1.0,
|
| 36 |
+
"wikipedia": 1.0,
|
| 37 |
+
"wikisource": 1.0,
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
dyna_test = {
|
| 41 |
+
"depbank": 1.0,
|
| 42 |
+
"jvj": 1.0,
|
| 43 |
+
"nordjyllandnews": 1.0,
|
| 44 |
+
"synne": 1.0,
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
cp_train = {
|
| 48 |
+
"arxiv_papers": 0.5,
|
| 49 |
+
"cccc": 0.3,
|
| 50 |
+
"data_provenance_initiative": 2,
|
| 51 |
+
"doab": 2,
|
| 52 |
+
"foodista": 2,
|
| 53 |
+
"libretexts": 2,
|
| 54 |
+
"news": 2,
|
| 55 |
+
"oercommons": 2,
|
| 56 |
+
"peS2o": 0.1,
|
| 57 |
+
"pressbooks": 2,
|
| 58 |
+
"public_domain_review": 2,
|
| 59 |
+
"python_enhancement_proposals": 2,
|
| 60 |
+
"stackexchange": 0.25,
|
| 61 |
+
"stackv2_edu": 0.1,
|
| 62 |
+
"wikimedia": 0.4,
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
sources = {
|
| 66 |
+
"dyna": {
|
| 67 |
+
"uri": "hf://datasets/danish-foundation-models/danish-dynaword/data/{key}/*.parquet",
|
| 68 |
+
"format": "parquet",
|
| 69 |
+
"shards": 1,
|
| 70 |
+
"shard_index": 0,
|
| 71 |
+
"train": dyna_train,
|
| 72 |
+
"test": dyna_test,
|
| 73 |
+
},
|
| 74 |
+
"cp": {
|
| 75 |
+
"uri": "hf://datasets/common-pile/comma_v0.1_training_dataset/{key}/*.jsonl.gz",
|
| 76 |
+
"format": "json",
|
| 77 |
+
"shards": 16,
|
| 78 |
+
"shard_index": 0,
|
| 79 |
+
"train": cp_train,
|
| 80 |
+
"test": {},
|
| 81 |
+
},
|
| 82 |
+
}
|
stage1/open-stage1.toml
ADDED
|
@@ -0,0 +1,42 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model_name = "llama3"
|
| 2 |
+
flavor = "Comma7B"
|
| 3 |
+
tokenizer_name = "common-pile/comma-v0.1-2t"
|
| 4 |
+
|
| 5 |
+
# job
|
| 6 |
+
job_name = "munin-7b-open-stage1"
|
| 7 |
+
wandb_project = "munin-7b-open-stage1"
|
| 8 |
+
enable_wandb = false
|
| 9 |
+
|
| 10 |
+
# parallelism
|
| 11 |
+
num_nodes = 1
|
| 12 |
+
data_parallel_shard_degree = 8
|
| 13 |
+
data_parallel_replicate_degree = 1
|
| 14 |
+
|
| 15 |
+
# training settings
|
| 16 |
+
train_batch_size = 8
|
| 17 |
+
seq_len = 4096
|
| 18 |
+
train_num_steps = 37852
|
| 19 |
+
scheduler = "linear_warmup_constant_sqrt_decay"
|
| 20 |
+
warmup_steps = 1000
|
| 21 |
+
cooldown_steps = 1000
|
| 22 |
+
checkpoint_interval = 1000
|
| 23 |
+
forced_load_path = "/work/training/maester/comma-v0.1-2t-dcp/"
|
| 24 |
+
compile = true
|
| 25 |
+
enable_cut_cross_entropy = false
|
| 26 |
+
ac_mode = "none"
|
| 27 |
+
selective_ac_option = "op"
|
| 28 |
+
|
| 29 |
+
[dataset]
|
| 30 |
+
bos_token = 2
|
| 31 |
+
eos_token = 1
|
| 32 |
+
data_dirs = [
|
| 33 |
+
"/work/production/data/munin-open-dyna-0-of-1-cp-0-of-16-train/",
|
| 34 |
+
]
|
| 35 |
+
dataset_weights = "1.0"
|
| 36 |
+
|
| 37 |
+
[opt_cfg] # must specify *all* fields here, will not merge with defaults
|
| 38 |
+
lr = 1e-5
|
| 39 |
+
betas = [0.9, 0.95]
|
| 40 |
+
weight_decay = 0.1
|
| 41 |
+
eps = 1e-9
|
| 42 |
+
fused = true
|
stage2/open-stage2.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
dyna_train = {
|
| 2 |
+
"adl": 1.0,
|
| 3 |
+
"ai-aktindsigt": 1.0,
|
| 4 |
+
"botxt": 1.0,
|
| 5 |
+
"cellar": 1.0,
|
| 6 |
+
"dannet": 1.0,
|
| 7 |
+
"danske-taler": 1.0,
|
| 8 |
+
"domsdatabasen": 1.0,
|
| 9 |
+
"enevaeldens_nyheder": 1.0,
|
| 10 |
+
"ep": 1.0,
|
| 11 |
+
"eur-lex-sum-da": 1.0,
|
| 12 |
+
"fm-udgivelser": 1.0,
|
| 13 |
+
"ft": 1.0,
|
| 14 |
+
"grundtvig": 1.0,
|
| 15 |
+
"gutenberg": 1.0,
|
| 16 |
+
"health_hovedstaden": 1.0,
|
| 17 |
+
"hest": 1.0,
|
| 18 |
+
"historical-danish-handwriting": 1.0,
|
| 19 |
+
"memo": 1.0,
|
| 20 |
+
"miljoeportalen": 1.0,
|
| 21 |
+
"naat": 1.0,
|
| 22 |
+
"ncc_books": 1.0,
|
| 23 |
+
"ncc_maalfrid": 1.0,
|
| 24 |
+
"ncc_newspaper": 1.0,
|
| 25 |
+
"ncc_parliament": 1.0,
|
| 26 |
+
"nota": 1.0,
|
| 27 |
+
"opensubtitles": 1.0,
|
| 28 |
+
"relig": 1.0,
|
| 29 |
+
"retsinformationdk": 1.0,
|
| 30 |
+
"skat": 1.0,
|
| 31 |
+
"retspraksis": 1.0,
|
| 32 |
+
"spont": 1.0,
|
| 33 |
+
"tv2r": 1.0,
|
| 34 |
+
"wiki-comments": 1.0,
|
| 35 |
+
"wikibooks": 1.0,
|
| 36 |
+
"wikipedia": 1.0,
|
| 37 |
+
"wikisource": 1.0,
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
dyna_test = {
|
| 41 |
+
"depbank": 1.0,
|
| 42 |
+
"jvj": 1.0,
|
| 43 |
+
"nordjyllandnews": 1.0,
|
| 44 |
+
"synne": 1.0,
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
cp_train = {
|
| 48 |
+
"arxiv_papers": 0.5,
|
| 49 |
+
"cccc": 0.3,
|
| 50 |
+
"data_provenance_initiative": 2,
|
| 51 |
+
"doab": 2,
|
| 52 |
+
"foodista": 2,
|
| 53 |
+
"libretexts": 2,
|
| 54 |
+
"news": 2,
|
| 55 |
+
"oercommons": 2,
|
| 56 |
+
"peS2o": 0.1,
|
| 57 |
+
"pressbooks": 2,
|
| 58 |
+
"public_domain_review": 2,
|
| 59 |
+
"python_enhancement_proposals": 2,
|
| 60 |
+
"stackexchange": 0.25,
|
| 61 |
+
"stackv2_edu": 0.1,
|
| 62 |
+
"wikimedia": 0.4,
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
sources = {
|
| 66 |
+
"dyna": {
|
| 67 |
+
"uri": "hf://datasets/danish-foundation-models/danish-dynaword/data/{key}/*.parquet",
|
| 68 |
+
"format": "parquet",
|
| 69 |
+
"shards": 1,
|
| 70 |
+
"shard_index": 0,
|
| 71 |
+
"train": dyna_train,
|
| 72 |
+
"test": dyna_test,
|
| 73 |
+
},
|
| 74 |
+
"cp": {
|
| 75 |
+
"uri": "hf://datasets/common-pile/comma_v0.1_training_dataset/{key}/*.jsonl.gz",
|
| 76 |
+
"format": "json",
|
| 77 |
+
"shards": 16,
|
| 78 |
+
"shard_index": 1,
|
| 79 |
+
"train": cp_train,
|
| 80 |
+
"test": {},
|
| 81 |
+
},
|
| 82 |
+
}
|
stage2/open-stage2.toml
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model_name = "llama3"
|
| 2 |
+
flavor = "Comma7B"
|
| 3 |
+
tokenizer_name = "common-pile/comma-v0.1-2t"
|
| 4 |
+
|
| 5 |
+
# job
|
| 6 |
+
job_name = "munin-7b-open-stage2"
|
| 7 |
+
wandb_project = "munin-7b-open-stage2"
|
| 8 |
+
enable_wandb = false
|
| 9 |
+
|
| 10 |
+
# parallelism
|
| 11 |
+
num_nodes = 1
|
| 12 |
+
data_parallel_shard_degree = 8
|
| 13 |
+
data_parallel_replicate_degree = 1
|
| 14 |
+
|
| 15 |
+
# training settings
|
| 16 |
+
train_batch_size = 8
|
| 17 |
+
gradient_accumulation_steps = 2
|
| 18 |
+
gradient_accumulation_sync_each_step = true
|
| 19 |
+
seq_len = 4096
|
| 20 |
+
train_num_steps = 18926 # 37852 // 2
|
| 21 |
+
scheduler = "linear_warmup_constant_sqrt_decay"
|
| 22 |
+
warmup_steps = 500
|
| 23 |
+
cooldown_steps = 500
|
| 24 |
+
checkpoint_interval = 1000
|
| 25 |
+
forced_load_path = "/work/training/maester/jobs/munin-7b-open-stage1/checkpoints/step-37852/"
|
| 26 |
+
compile = true
|
| 27 |
+
enable_cut_cross_entropy = false
|
| 28 |
+
ac_mode = "none"
|
| 29 |
+
selective_ac_option = "op"
|
| 30 |
+
|
| 31 |
+
[dataset]
|
| 32 |
+
bos_token = 2
|
| 33 |
+
eos_token = 1
|
| 34 |
+
data_dirs = [
|
| 35 |
+
"/work/production/data/dsk-open-dyna-0-of-1-cp-1-of-16-train/",
|
| 36 |
+
]
|
| 37 |
+
dataset_weights = "1.0"
|
| 38 |
+
|
| 39 |
+
[opt_cfg] # must specify *all* fields here, will not merge with defaults
|
| 40 |
+
lr = 1e-5
|
| 41 |
+
betas = [0.9, 0.95]
|
| 42 |
+
weight_decay = 0.1
|
| 43 |
+
eps = 1e-9
|
| 44 |
+
fused = true
|
stage3/open-stage3.py
ADDED
|
@@ -0,0 +1,82 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
dyna_train = {
|
| 2 |
+
"adl": 1.0,
|
| 3 |
+
"ai-aktindsigt": 1.0,
|
| 4 |
+
"botxt": 1.0,
|
| 5 |
+
"cellar": 1.0,
|
| 6 |
+
"dannet": 1.0,
|
| 7 |
+
"danske-taler": 1.0,
|
| 8 |
+
"domsdatabasen": 1.0,
|
| 9 |
+
"enevaeldens_nyheder": 1.0,
|
| 10 |
+
"ep": 1.0,
|
| 11 |
+
"eur-lex-sum-da": 1.0,
|
| 12 |
+
"fm-udgivelser": 1.0,
|
| 13 |
+
"ft": 1.0,
|
| 14 |
+
"grundtvig": 1.0,
|
| 15 |
+
"gutenberg": 1.0,
|
| 16 |
+
"health_hovedstaden": 1.0,
|
| 17 |
+
"hest": 1.0,
|
| 18 |
+
"historical-danish-handwriting": 1.0,
|
| 19 |
+
"memo": 1.0,
|
| 20 |
+
"miljoeportalen": 1.0,
|
| 21 |
+
"naat": 1.0,
|
| 22 |
+
"ncc_books": 1.0,
|
| 23 |
+
"ncc_maalfrid": 1.0,
|
| 24 |
+
"ncc_newspaper": 1.0,
|
| 25 |
+
"ncc_parliament": 1.0,
|
| 26 |
+
"nota": 1.0,
|
| 27 |
+
"opensubtitles": 1.0,
|
| 28 |
+
"relig": 1.0,
|
| 29 |
+
"retsinformationdk": 1.0,
|
| 30 |
+
"skat": 1.0,
|
| 31 |
+
"retspraksis": 1.0,
|
| 32 |
+
"spont": 1.0,
|
| 33 |
+
"tv2r": 1.0,
|
| 34 |
+
"wiki-comments": 1.0,
|
| 35 |
+
"wikibooks": 1.0,
|
| 36 |
+
"wikipedia": 1.0,
|
| 37 |
+
"wikisource": 1.0,
|
| 38 |
+
}
|
| 39 |
+
|
| 40 |
+
dyna_test = {
|
| 41 |
+
"depbank": 1.0,
|
| 42 |
+
"jvj": 1.0,
|
| 43 |
+
"nordjyllandnews": 1.0,
|
| 44 |
+
"synne": 1.0,
|
| 45 |
+
}
|
| 46 |
+
|
| 47 |
+
cp_train = {
|
| 48 |
+
"arxiv_papers": 0.5,
|
| 49 |
+
"cccc": 0.3,
|
| 50 |
+
"data_provenance_initiative": 2,
|
| 51 |
+
"doab": 2,
|
| 52 |
+
"foodista": 2,
|
| 53 |
+
"libretexts": 2,
|
| 54 |
+
"news": 2,
|
| 55 |
+
"oercommons": 2,
|
| 56 |
+
"peS2o": 0.1,
|
| 57 |
+
"pressbooks": 2,
|
| 58 |
+
"public_domain_review": 2,
|
| 59 |
+
"python_enhancement_proposals": 2,
|
| 60 |
+
"stackexchange": 0.25,
|
| 61 |
+
"stackv2_edu": 0.1,
|
| 62 |
+
"wikimedia": 0.4,
|
| 63 |
+
}
|
| 64 |
+
|
| 65 |
+
sources = {
|
| 66 |
+
"dyna": {
|
| 67 |
+
"uri": "hf://datasets/danish-foundation-models/danish-dynaword/data/{key}/*.parquet",
|
| 68 |
+
"format": "parquet",
|
| 69 |
+
"shards": 1,
|
| 70 |
+
"shard_index": 0,
|
| 71 |
+
"train": dyna_train,
|
| 72 |
+
"test": dyna_test,
|
| 73 |
+
},
|
| 74 |
+
"cp": {
|
| 75 |
+
"uri": "hf://datasets/common-pile/comma_v0.1_training_dataset/{key}/*.jsonl.gz",
|
| 76 |
+
"format": "json",
|
| 77 |
+
"shards": 16,
|
| 78 |
+
"shard_index": 2,
|
| 79 |
+
"train": cp_train,
|
| 80 |
+
"test": {},
|
| 81 |
+
},
|
| 82 |
+
}
|
stage3/open-stage3.toml
ADDED
|
@@ -0,0 +1,44 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
model_name = "llama3"
|
| 2 |
+
flavor = "Comma7B"
|
| 3 |
+
tokenizer_name = "common-pile/comma-v0.1-2t"
|
| 4 |
+
|
| 5 |
+
# job
|
| 6 |
+
job_name = "munin-7b-open-stage3"
|
| 7 |
+
wandb_project = "munin-7b-open-stage3"
|
| 8 |
+
enable_wandb = false
|
| 9 |
+
|
| 10 |
+
# parallelism
|
| 11 |
+
num_nodes = 1
|
| 12 |
+
data_parallel_shard_degree = 8
|
| 13 |
+
data_parallel_replicate_degree = 1
|
| 14 |
+
|
| 15 |
+
# training settings
|
| 16 |
+
train_batch_size = 8
|
| 17 |
+
gradient_accumulation_steps = 2
|
| 18 |
+
gradient_accumulation_sync_each_step = true
|
| 19 |
+
seq_len = 4096
|
| 20 |
+
train_num_steps = 18926 # 37852 // 2
|
| 21 |
+
scheduler = "linear_warmup_constant_sqrt_decay"
|
| 22 |
+
warmup_steps = 500
|
| 23 |
+
cooldown_steps = 18426
|
| 24 |
+
checkpoint_interval = 1000
|
| 25 |
+
forced_load_path = "/work/training/maester/jobs/munin-7b-open-stage2/checkpoints/step-18926/"
|
| 26 |
+
compile = true
|
| 27 |
+
enable_cut_cross_entropy = false
|
| 28 |
+
ac_mode = "none"
|
| 29 |
+
selective_ac_option = "op"
|
| 30 |
+
|
| 31 |
+
[dataset]
|
| 32 |
+
bos_token = 2
|
| 33 |
+
eos_token = 1
|
| 34 |
+
data_dirs = [
|
| 35 |
+
"/work/production/data/dsk-open-dyna-0-of-1-cp-2-of-16-train/",
|
| 36 |
+
]
|
| 37 |
+
dataset_weights = "1.0"
|
| 38 |
+
|
| 39 |
+
[opt_cfg] # must specify *all* fields here, will not merge with defaults
|
| 40 |
+
lr = 1e-5
|
| 41 |
+
betas = [0.9, 0.95]
|
| 42 |
+
weight_decay = 0.1
|
| 43 |
+
eps = 1e-9
|
| 44 |
+
fused = true
|