90 model merge started

by Naphula - opened 4 days ago

EldritchLabs org 4 days ago

•

I am merging 90 Nemo models via the karcher method. Uses a shit ton of pagefile RAM and take about 48 hours. No idea if it will work properly or be tainted by broken tokenizers. I'll post another update tomorrow when its finished.

With 4TB SSD I can merge up to maybe ~120 12B models at once, as 90 is 2TB storage + 1TB pagefile.

Naphula

EldritchLabs org 4 days ago

investigating the error

Error quantizing: main: build = 0 (unknown)
main: built with MSVC 19.29.30159.0 for x64
main: quantizing 'outputs\tmpxqcrs556\DeepWater-Pleroma-12B-v1.fp16.gguf' to 'outputs\tmpxqcrs556\deepwater-pleroma-12b-v1-Q6_K.gguf' as Q6_K
llama_model_loader: loaded meta data with 35 key-value pairs and 363 tensors from outputs\tmpxqcrs556\DeepWater-Pleroma-12B-v1.fp16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = EldritchLabs__DeepWater Pleroma 12B v1
llama_model_loader: - kv   3:                            general.version str              = v1
llama_model_loader: - kv   4:                           general.basename str              = EldritchLabs__DeepWater-Pleroma
llama_model_loader: - kv   5:                         general.size_label str              = 12B
llama_model_loader: - kv   6:                   general.base_model.count u32              = 0
llama_model_loader: - kv   7:                               general.tags arr[str,2]       = ["mergekit", "merge"]
llama_model_loader: - kv   8:                          llama.block_count u32              = 40
llama_model_loader: - kv   9:                       llama.context_length u32              = 131072
llama_model_loader: - kv  10:                     llama.embedding_length u32              = 5120
llama_model_loader: - kv  11:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  12:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  13:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  14:                       llama.rope.freq_base f32              = 1000000.000000
llama_model_loader: - kv  15:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  16:                 llama.attention.key_length u32              = 128
llama_model_loader: - kv  17:               llama.attention.value_length u32              = 128
llama_model_loader: - kv  18:                          general.file_type u32              = 1
llama_model_loader: - kv  19:                           llama.vocab_size u32              = 131072
llama_model_loader: - kv  20:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  21:               general.quantization_version u32              = 2
llama_model_loader: - kv  22:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  23:                         tokenizer.ggml.pre str              = tekken
llama_model_loader: - kv  24:                      tokenizer.ggml.tokens arr[str,131072]  = ["<unk>", "<s>", "<|im_end|>", "<|im_...
llama_model_loader: - kv  25:                  tokenizer.ggml.token_type arr[i32,131072]  = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv  26:                      tokenizer.ggml.merges arr[str,269443]  = ["Ä  Ä ", "Ä  t", "e r", "i n", "Ä  Ä...
llama_model_loader: - kv  27:                tokenizer.ggml.bos_token_id u32              = 1
llama_model_loader: - kv  28:                tokenizer.ggml.eos_token_id u32              = 2
llama_model_loader: - kv  29:            tokenizer.ggml.unknown_token_id u32              = 0
llama_model_loader: - kv  30:            tokenizer.ggml.padding_token_id u32              = 10
llama_model_loader: - kv  31:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  32:               tokenizer.ggml.add_sep_token bool             = false
llama_model_loader: - kv  33:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  34:            tokenizer.ggml.add_space_prefix bool             = false
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  282 tensors
ggml_validate_row_data: found inf value at block 367001600
llama_model_quantize: failed to quantize: tensor 'output.weight' has invalid data
main: failed to quantize model from 'outputs\tmpxqcrs556\DeepWater-Pleroma-12B-v1.fp16.gguf'

Topsy1

3 days ago

Hello, i dont know much about finetuning and merging. What are the benefits of merging so many models vs just a couple? In my mind it would just overwrite eachother since the paramaters dont increase.

Naphula

EldritchLabs org 3 days ago

Hello, i dont know much about finetuning and merging. What are the benefits of merging so many models vs just a couple? In my mind it would just overwrite eachother since the paramaters dont increase.

It's all theoretical and there might not be any benefit. In fact, most methods collapse with too many models. The theory behind this was to see if Karcher's "center" in this case—a collective of mostly finetunes sprinkled with a few merges—could capture some unique novelty and creativity not seen in typical smaller merges. I wanted to use model_stock initially but ran into severe tokenizer issues with it.

Karcher is different from task vector methods like model_stock and SCE/della as it operates on the Riemannian sphere. This seems more stable because it has to find a true center of X amount of models, it doesn't get to ignore or cut vectors, it requires a holistic analysis of everything, every single bit of data, and that's resource intensive at float32.

Other custom methods require even stronger GPUs to test but are theoretically more resistant at higher scales. Some of these more complex methods are very experimental and attempt to 'go beyond the params into the space between them'. I don't have enough time or resources to confirm they actually work or not yet, but some ideas are in progress.

Naphula

EldritchLabs org 3 days ago

•

edited 3 days ago

Unfortunately the merge is broken, not sure if it can be fixed. It quantized now but has the same slop, endless repetition, and repeating tokenizer commands

RAW bugged weights: https://huggingface.co/Naphula-Archives/DeepWater-Pleroma-12B-v0-raw-weights

Naphula

EldritchLabs org 2 days ago

I posted the healer script to the repo, maybe I'll come back to this experiment later, but for now I'm testing other 12B merges.

Lesson: Best to start small and work your way up. Giant merges like these are bound to fail if you don't know enough of the details of each model. 12B is way more fussy than 24B with tokenizers.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment