Maracatu-80M 🥁

Open-weight Brazilian Portuguese language model trained from scratch. 87.80M total parameters (75.52M non-embedding). Apache 2.0.

Maracatu-80M is a causal language model trained from scratch on Brazilian Portuguese text. It is the second public checkpoint of the Maracatu AI project, an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training procedure.

The model was trained for 200,000 iterations on 1.60B tokens of curated PT-BR text (Wikipedia, Project Gutenberg, CulturaX-PT filtered), reaching a validation perplexity of 21.34 on a 3.27M-token holdout. At this scale, it outperforms the public Tucano-160M baseline (reported ~22) despite having roughly half the parameters.

This is a base model (next-token completion). It is not an instruction-following assistant. Fine-tune it or use it as a starting point for downstream tasks.

Architecture

Llama-style decoder-only transformer. The state_dict is compatible with transformers.LlamaForCausalLM and loads via AutoModelForCausalLM.from_pretrained without any conversion script (validated max_abs_diff=0.0 against native forward pass).

Hyperparameter	Value
Total parameters	87.80M
Non-embedding parameters	75.52M
`num_hidden_layers`	12
`hidden_size`	768
`num_attention_heads`	12
`num_key_value_heads`	4 (GQA, 3:1 ratio)
`intermediate_size` (SwiGLU)	2048
`max_position_embeddings`	1024
`vocab_size`	16000
`rope_theta`	10000.0
`rms_norm_eps`	1e-5
Normalization	RMSNorm
Positional encoding	RoPE (rotate-half, HF implementation)
Activation	SwiGLU
Bias in `nn.Linear`	No
Weight tying (embed ↔ lm_head)	Yes
Tokenizer	SentencePiece BPE, `nmt_nfkc_cf` (lowercase), `split_digits`, byte fallback

The model uses Grouped Query Attention (GQA) with 12 Q-heads sharing 4 KV-heads (3:1 ratio), matching the pattern used in Llama-3. This reduces KV-cache size during inference without measurable perplexity impact at this scale.

Training Data

Corpus v2 — 1.60B tokens total. SHA-256: a1000e873bfcae0d2229ecc9b329f0befe8ad73913e79e58f14a1f3a48ef7e58

Source	License	Approx. tokens	Notes
Wikipedia PT (`20231101.pt`)	CC BY-SA 3.0	~550M	HF dataset `wikimedia/wikipedia`
Project Gutenberg PT	Public Domain	~150M	24 curated works (Machado de Assis, José de Alencar, Aluísio Azevedo, Eça de Queirós, Graciliano Ramos, Lima Barreto, Monteiro Lobato, Olavo Bilac, Castro Alves, others)
CulturaX-PT filtered	ODC-BY 1.0	~900M	1.49M documents after filtering

Filtering pipeline: MinHash LSH deduplication (Jaccard threshold 0.85), PII regex removal (CPF, email, CEP, phone patterns BR), language heuristic filter, byte-level deduplication.

Excluded by design: raw Common Crawl, CC BY-NC sources (Carolina corpus excluded for commercial flexibility). No uncurated web crawl data in v1 corpora.

Chinchilla ratio: 1.60B tokens / 75.52M non-embedding params ≈ 21.2 tokens/param (near the compute-optimal estimate of ~20 from Hoffmann et al., 2022).

Training Procedure

Hyperparameters

Item	Value
Framework	PyTorch
Precision	bf16 autocast (forward in bf16; weights and optimizer states in fp32)
Optimizer	AdamW
Peak learning rate	2.5e-4
Minimum learning rate	2.5e-5
LR schedule	Cosine decay with linear warmup
Warmup iterations	4,000
Total iterations	200,000
Batch size	8
Context length	1,024 tokens
Gradient accumulation	None
Gradient clipping	1.0
AdamW betas	β₁=0.9, β₂=0.95
Weight decay	0.1
Tokens seen	~1.64B

Hardware

Item	Value
GPU	NVIDIA RTX 3060 12GB VRAM
Setup	Single GPU, self-hosted
Training time	22h 31min (continuous)
Throughput	~20,200 tok/s (stable throughout)
Memory leaks	None observed over 22.5h

Evaluation

Validation Perplexity

Evaluated on a 3.27M-token holdout (0.5% of corpus, last chronological segment, not seen during training).

Metric	Value	Step
Best validation loss (during training)	3.0163	~190,000
Final validation loss	3.0604	200,000
Final validation perplexity	21.34	200,000

Comparison with Public Baselines

Perplexity comparison on held-out PT-BR text. Numbers reported on the same metric (next-token perplexity); vocabulary differences mean these are indicative, not formally controlled benchmarks.

Model	Params	Val Perplexity	Source
Tucano-160M	160M	~22 (reported)	Correa et al., 2024
Maracatu-80M (this release)	87.80M	21.34	This card

Maracatu-80M reaches lower perplexity than Tucano-160M with approximately half the parameters. The primary contributing factors are the expanded and deduplicated corpus (1.60B tokens vs Wikipedia-only) and GQA attention. A formally controlled comparison (same holdout split, same vocabulary) is planned for the Maracatu-800M paper submission.

Downstream Benchmarks

Evaluated with lm-evaluation-harness v0.4.11 (zero-shot, default prompts). Results below are honest single-run numbers without cherry-picking.

Task	Metric	Score	Stderr	Random baseline
ENEM Challenge (1432 questions)	acc	20.27%	±1.06%	20% (5-MCQ)
ASSIN Entailment	acc	29.08%	±0.72%	~33% (3-class)
ASSIN Paraphrase	acc	52.42%	±0.79%	50% (binary)

Honest interpretation:

ENEM (20.27%): indistinguishable from random chance. Expected behavior for a base model at this scale: ENEM requires multi-step reasoning over factual content, and 80M parameters without instruction tuning is insufficient. This number will improve with the 800M release and instruction tuning, not with more pretrain data alone.
ASSIN Entailment (29.08%): slightly below the 3-class random baseline of ~33%, suggesting a systematic bias in option selection rather than meaningful task understanding. The model is not extracting useful entailment signal from this prompt format.
ASSIN Paraphrase (52.42%): ~3 standard errors above the 50% binary random baseline. This is statistically significant but modest, and likely reflects surface-level lexical overlap detection rather than semantic paraphrase reasoning.

Comparison with Maracatu-20M: the published Maracatu-20M model card reports 60.52% on ASSIN Paraphrase. The 80M's 52.42% is lower despite the model being larger and having lower validation perplexity. Possible explanations: (1) the 20M evaluation used a different lm-evaluation-harness version with different prompt normalization, (2) MCQ accuracy in small base models has high variance because models with weak signal-to-noise often guess, (3) pretrain quality improvements (perplexity 23.81 → 21.34) translate to fluency, not necessarily to discriminative MCQ scores. A controlled re-evaluation of both models on the same harness version is planned for the 800M paper. The 80M's drop here is reported as-is, without smoothing.

What the 80M does well vs. what it does not: this model is good at producing fluent Brazilian Portuguese text (reflected in lower validation perplexity). It is not good at multiple-choice reasoning over factual or semantic tasks. For deployment, treat it as a generation backbone, not as a question-answering system.

Qualitative Samples

Generated with temperature=0.8, top_k=50, repetition_penalty=1.1, seed 123. Note that the tokenizer normalizes all text to lowercase.

Prompt: "machado de assis nasceu no rio de janeiro"

machado de assis nasceu no rio de janeiro. estudou na faculdade de direito da universidade federal do rio de janeiro (ufrj). participou das comissões técnicas com a experiência de seu trabalho e da comissão de ética de seus atos, em 1995. atuou também como advogado em diversos processos de processo, inclusive na ação penal. exerceu importante atuação na defesa dos direitos humanos e no ministério público.

(Note: factually incorrect biographical content. The model produces fluent prose but cannot retrieve verified facts. Expected behavior at 80M scale.)

Prompt: "o brasil e um pais de dimensoes continentais"

o brasil e um pais de dimensoes continentais, assim sendo de um lado o da américa latina, que se tem feito a visita dos mesmos em suas diferentes escalas, de onde se pode observar a maior parte do movimento e se deslocando para o brasil.

Prompt: "a inteligencia artificial pode"

a inteligencia artificial pode ser considerada uma coisa, e sim, um "bem". mas a coisa que mais se sabe sobre a inteligência artificial é que ela pode ter sido usada para prever as mudanças do ambiente, ou se esta não era a primeira. se fosse um "bem", poderíamos afirmar isso.

Prompt: "em uma manha de domingo, joao caminhava"

em uma manha de domingo, joao caminhava pelo centro da cidade, até um carro da polícia federal na região. quando o policial chegou não sabe o que aconteceu e acabou pegando a arma para ser removida.

These samples illustrate that Maracatu-80M produces fluent Brazilian Portuguese with reasonable grammar and topical coherence, but is unreliable for factual content. This is expected behavior for a base model at this scale.

How to Use

HuggingFace `transformers`

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-80m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-80m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        do_sample=True,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))

Note: The tokenizer normalizes all input to lowercase (nmt_nfkc_cf). Pass text in any case; it will be folded internally.

Ollama (local inference)

ollama run whereisanzi/maracatu-80m "O Brasil é"

PyTorch (native checkpoint)

import torch
import sentencepiece as spm
from maracatu.model import Maracatu, ModelConfig

ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
model = Maracatu(ModelConfig(**ckpt["model_config"]))
model.load_state_dict({k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()})
model.eval()

sp = spm.SentencePieceProcessor(model_file="maracatu.model")
prompt_ids = torch.tensor([sp.encode("O Brasil é")])
out = model.generate(prompt_ids, max_new_tokens=100, temperature=0.8, top_k=50)
print(sp.decode(out[0].tolist()))

Quantizations

Available via Ollama (whereisanzi/maracatu-80m):

Format	File size	Notes
fp16	168 MB	Full precision inference
Q8_0	90 MB	Near-lossless, recommended for evaluation
Q5_K_M	61 MB	Good quality/size balance
Q4_K_M	54 MB	Default Ollama variant

Limitations

Scale: 87.80M parameters is small by 2026 standards. Factual retrieval is unreliable; hallucination is expected and frequent. Do not use for factual question answering without retrieval augmentation.
Lowercase only: The tokenizer normalizes all text to lowercase. The model does not produce uppercase output.
No instruction tuning: This is a base model. It completes text; it is not a chat assistant. Fine-tune before deploying in any interactive setting.
Context window: 1,024 tokens. Longer inputs are truncated.
Corpus biases: Training data reflects the coverage and perspective biases of Wikipedia, Project Gutenberg, and CulturaX-PT. Topics underrepresented in those sources will produce lower-quality output. The Gutenberg subset skews toward 19th-century Brazilian and Portuguese literary Portuguese.
No safety fine-tuning: This model has not been evaluated for safety and may produce biased, incorrect, or harmful content. It is intended for research use.
Not state-of-the-art in absolute terms: Frontier PT-BR models (e.g., closed Sabiá-3) report perplexities well below 10. The contribution of this release is methodology and reproducibility at a compute-accessible scale, not absolute performance.
Perplexity comparison caveat: The comparison with Tucano-160M is indicative. Vocabulary size and holdout composition differ; a formally controlled comparison is planned.

License

Code and weights are released under the Apache License 2.0.

Training data licenses:

Wikipedia PT: CC BY-SA 3.0
Project Gutenberg works: Public Domain
CulturaX-PT: ODC-BY 1.0

Citation

@misc{anzileiro2026maracatu80m,
  author    = {Anzileiro, Anderson},
  title     = {Maracatu-80M: An Open-Weight Brazilian Portuguese Language Model},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/maracatu-ai/maracatu-80m},
}

Acknowledgments

Andrej Karpathy — nanoGPT, pedagogical foundation for the training loop.
Maritaca AI — Sabiá paper, reference for Portuguese LLM design decisions.
WideLabs (Unicamp) — Tucano paper, primary public baseline for PT-BR model comparison.
MCTI/PBIA — Plano Brasileiro de Inteligência Artificial, policy context for open PT-BR LLM development.

Roadmap

The Maracatu ladder releases open-weight checkpoints at increasing scale, each trained from scratch on expanding PT-BR corpora.

Release	Parameters	Status
Maracatu-20M	17M	Released April 2026
Maracatu-80M	87.80M	This release, April 2026
Maracatu-800M	~800M	Planned H2 2026; paper submission to STIL/BRACIS/PROPOR
Maracatu-8B	~8B	Planned 2027
Maracatu-80B	~80B	North Star; target: match Llama-3.1-70B on PT-BR benchmarks

Source code: github.com/maracatu-ai/maracatu

Downloads last month: 337

Safetensors

Model size

87.8M params

Tensor type

F32

Model tree for maracatu-ai/maracatu-80m

Quantizations

1 model

Dataset used to train maracatu-ai/maracatu-80m

Papers for maracatu-ai/maracatu-80m

Evaluation results

Validation Perplexity (holdout PT-BR, 3.27M tokens)
self-reported

21.340

Maracatu-80M 🥁

Architecture

Training Data

Training Procedure

Hyperparameters

Hardware

Evaluation

Validation Perplexity

Comparison with Public Baselines

Downstream Benchmarks

Qualitative Samples

How to Use

HuggingFace transformers

Ollama (local inference)

PyTorch (native checkpoint)

Quantizations

Limitations

License

Citation

Acknowledgments

Roadmap

Model tree for maracatu-ai/maracatu-80m

Dataset used to train maracatu-ai/maracatu-80m

Papers for maracatu-ai/maracatu-80m

Evaluation results

HuggingFace `transformers`