LastBox v6 — SFT-warmup checkpoint for 72 % tool emission on RPi 5

Post-deadline iteration of lastbox-gemma4-e2b-sft-v3. A 12-minute targeted SFT pass on 1 034 tool-only pairs took the agentic score from 0.016 → 0.608 (38×) — a result two GRPO iterations couldn't reach in 3 h of GB10 time. Same 2B params, same Q4_K_M quant, same 8 GB Raspberry Pi 5 deployment.

What changed vs v3

The v3 SFT trained on full dialogs that contained tool calls but the inference-time prompt structure suppressed them. Two fixes unlocked tool emission:

  1. SFT warmup on tool-only pairs (train_v2_toolonly.jsonl, 1 034 [user, assistant_tool_call] pairs only). Sets the prior on first-turn tool emission.
  2. Eval prompt fix — the original eval was passing 733-char SYSTEM_PROMPT_EN alone; training prompts had a 4 233-char system block with the 7-tool JSON whitelist embedded. Without that block the model correctly inferred "no tool defs ⇒ don't emit tool calls". Eval now uses process_v2._build_system_prompt_with_tools().
  3. Non-streaming POST + 2-retry in eval — fixes the 13/25 streaming-SSE completion floor that was masquerading as a quality issue.

Three independent fixes, three independent metrics each unlocked.

Eval (golden_en, n=25, tool-allowed system + non-streaming POST)

Metric v3 SFT (this base) v6 (this checkpoint) Δ
tool_emission_rate ~0% 72% +72 pp
tool_accuracy 0% 64% +64 pp
arg_validity 4% 56% +52 pp
agentic_score 0.016 0.608 38×
byte_compliance 0.48 1.000 +0.52
format_ok 0.52 1.000 +0.48
persona_ok 0.52 1.000 +0.48
response_quality 0.506 1.000 +0.494
completed / 25 14 25 perfect

What we tried before this (and why it didn't work)

  • v4 GRPO (reward v1): 0.5·tool_match + 0.3·format + 0.2·byte_cap. 200 steps, num_generations=4. Reward climbed 0.5 → 0.7 in training but tool emission collapsed at eval. The reward gave equal credit to "emit correct tool" and "skip tool, give direct answer" — the no-tool path is easier to reach so GRPO converged there.
  • v5 GRPO (reward v2): tool-correct +1.0, no-tool-when-expected −0.5, spurious-tool 0.0. Stronger gradient toward tool emission. Same plateau — with KL β=0.04 the model cannot move from p(tool)≈0 to p(tool)≈1 in 200 steps; the KL term blocks the required policy shift.

The lesson: when SFT base has p(behaviour)≈0, targeted SFT to set the prior + GRPO for shaping is the right pipeline. A pure-GRPO assault on a behaviour the base never exhibits is fundamentally bounded by KL.

Files

File Size Purpose
lastbox-gemma4-e2b-v6-q4_k_m.gguf 3.2 GB Quantized inference weights
lastbox-gemma4-e2b-v6-bf16.gguf 8.7 GB bf16 source for quantization
model.safetensors 9.6 GB Merged HF format
lora/adapter_model.safetensors 49 MB LoRA adapter (on top of v3 base)
tokenizer.json + chat template + config – Required for inference

Quickstart — llama.cpp server

docker run --rm -p 11436:8080 \
  -v $(pwd):/models:ro \
  ghcr.io/ggml-org/llama.cpp:server \
  -m /models/lastbox-gemma4-e2b-v6-q4_k_m.gguf \
  --host 0.0.0.0 --port 8080 \
  --ctx-size 4096 --threads 4 --parallel 1 \
  --jinja --chat-template-file /models/chat_template.jinja \
  --no-mmap --mlock

Training recipe

base:               norecyc/lastbox-gemma4-e2b-sft-v3 (HF format)
framework:          Unsloth FastModel + TRL SFTTrainer
LoRA:               r=8, alpha=8, dropout=0, bias="none"
target modules:     language layers + attention + MLP, no vision
chat template:      gemma-4 (no-thinking)
hardware:           1× NVIDIA GB10 (Grace Blackwell, 121 GB unified, CUDA 13)
data:               1 034 tool-only pairs (norecyc/lastbox-survival-dialogues, train_v2_toolonly)
epochs:             1 (65 optimizer steps)
lr:                 2e-4, cosine, warmup 10 steps
batch / grad accum: 4 / 4 (effective 16)
training time:      12 min on GB10
final train loss:   0.018

Important: matching the training prompt at inference time

The model emits <tool_call> only when given the full training-time system prompt with the 7-tool definitions JSON block. To reproduce the 72% tool emission rate at inference:

from gemma4.scripts.process_v2 import _build_system_prompt_with_tools
system = _build_system_prompt_with_tools()
# pass `system` as the system role; do NOT use the shorter SYSTEM_PROMPT_EN alone.

Without that block, the model degrades to plain text answers (which it still does well — byte/format/persona stay at 1.000).

Dataset

norecyc/lastbox-survival-dialogues — specifically the train_v2_toolonly config.

License

Model weights are subject to the Gemma Terms of Use. Training scripts and LoRA adapter weights are Apache 2.0 (see GitHub repo).

Citation

@misc{lastbox_gemma4_v6_2026,
  title  = {LastBox v6: SFT-warmup checkpoint with 72 % tool emission on Raspberry Pi 5},
  author = {Mateusz Pawelczuk},
  year   = {2026},
  url    = {https://huggingface.co/norecyc/lastbox-gemma4-e2b-v6-toolprior}
}
Downloads last month
608
Safetensors
Model size
5B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for norecyc/lastbox-gemma4-e2b-v6-toolprior

Adapter
(1)
this model

Dataset used to train norecyc/lastbox-gemma4-e2b-v6-toolprior

Space using norecyc/lastbox-gemma4-e2b-v6-toolprior 1