@Locutusque on Hugging Face: "🚀 Introducing Esmeralda-Llama-3.1-8B-control The first release in the…"

Post

138

🚀 Introducing Esmeralda-Llama-3.1-8B-control
The first release in the Esmeralda model family by Locutusque.

This model is intentionally small and experimental — a control/baseline proof-of-concept designed to answer one question:

«“How strong is my new "Locutusque/esmeralda-agentic" dataset before scaling to larger runs?”»

Training Details

- Base: Llama 3.1 8B
- Training precision: bf16 mixed precision
- Chat template: modified ChatML
- Dataset size: ~37k examples
- Examples actually used for this run: ~5k

The dataset includes:

- multi-turn agentic traces
- reasoning traces
- structured assistant behavior
- generalist instruction data

Benchmark Results

Compared against:

- Llama 3.1 8B Instruct
- Hermes-3-Llama-3.1-8B

HumanEval

57.3 — Esmeralda
56.1 — Llama 3.1 Instruct
52.4 — Hermes-3

MBPP

53.2 — Esmeralda
56.8 — Llama 3.1 Instruct
48.2 — Hermes-3

GPQA Diamond

15.7 — Esmeralda
15.7 — Llama 3.1 Instruct
18.2 — Hermes-3

EQ-Bench

59.2 — Esmeralda
61.1 — Llama 3.1 Instruct
63.1 — Hermes-3

EQ-Bench Parseable (Syntax Stability)

🔥 100.0% — Esmeralda
92.4% — Llama 3.1 Instruct
91.2% — Hermes-3

Here Be Dragons 🐉

I also experimented with a new TruthfulQA free-generation evaluation setup.

- Responses were judged by Gemma 4 26B A4B
- The judge compared generations directly against ground-truth answers
- Models were evaluated in 8-bit quantized form to speed up inference

TruthfulQA (LLM Judge)

0.682 — Esmeralda-Llama-3.1-8B-control
0.587 — Hermes-3-Llama-3.1-8B (reported MC2 score; methodology differs)

For a lightweight control run trained on only a fraction of the dataset, I’m pretty encouraged by the results.

The model is released under the standard Llama 3.1 license, and I’d genuinely love feedback from people testing it in real workflows.

Model: Locutusque/Esmeralda-Llama-3.1-8B-control

Dataset: Locutusque/esmeralda-agentic

Join the conversation