Maracatu-80M 🥁
Open-weight Brazilian Portuguese language model trained from scratch. 87.80M total parameters (75.52M non-embedding). Apache 2.0.
Maracatu-80M is a causal language model trained from scratch on Brazilian Portuguese text. It is the second public checkpoint of the Maracatu AI project, an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training procedure.
The model was trained for 200,000 iterations on 1.60B tokens of curated PT-BR text (Wikipedia, Project Gutenberg, CulturaX-PT filtered), reaching a validation perplexity of 21.34 on a 3.27M-token holdout. At this scale, it outperforms the public Tucano-160M baseline (reported ~22) despite having roughly half the parameters.
This is a base model (next-token completion). It is not an instruction-following assistant. Fine-tune it or use it as a starting point for downstream tasks.
Architecture
Llama-style decoder-only transformer. The state_dict is compatible with transformers.LlamaForCausalLM and loads via AutoModelForCausalLM.from_pretrained without any conversion script (validated max_abs_diff=0.0 against native forward pass).
| Hyperparameter | Value |
|---|---|
| Total parameters | 87.80M |
| Non-embedding parameters | 75.52M |
num_hidden_layers |
12 |
hidden_size |
768 |
num_attention_heads |
12 |
num_key_value_heads |
4 (GQA, 3:1 ratio) |
intermediate_size (SwiGLU) |
2048 |
max_position_embeddings |
1024 |
vocab_size |
16000 |
rope_theta |
10000.0 |
rms_norm_eps |
1e-5 |
| Normalization | RMSNorm |
| Positional encoding | RoPE (rotate-half, HF implementation) |
| Activation | SwiGLU |
Bias in nn.Linear |
No |
| Weight tying (embed ↔ lm_head) | Yes |
| Tokenizer | SentencePiece BPE, nmt_nfkc_cf (lowercase), split_digits, byte fallback |
The model uses Grouped Query Attention (GQA) with 12 Q-heads sharing 4 KV-heads (3:1 ratio), matching the pattern used in Llama-3. This reduces KV-cache size during inference without measurable perplexity impact at this scale.
Training Data
Corpus v2 — 1.60B tokens total. SHA-256: a1000e873bfcae0d2229ecc9b329f0befe8ad73913e79e58f14a1f3a48ef7e58
| Source | License | Approx. tokens | Notes |
|---|---|---|---|
Wikipedia PT (20231101.pt) |
CC BY-SA 3.0 | ~550M | HF dataset wikimedia/wikipedia |
| Project Gutenberg PT | Public Domain | ~150M | 24 curated works (Machado de Assis, José de Alencar, Aluísio Azevedo, Eça de Queirós, Graciliano Ramos, Lima Barreto, Monteiro Lobato, Olavo Bilac, Castro Alves, others) |
| CulturaX-PT filtered | ODC-BY 1.0 | ~900M | 1.49M documents after filtering |
Filtering pipeline: MinHash LSH deduplication (Jaccard threshold 0.85), PII regex removal (CPF, email, CEP, phone patterns BR), language heuristic filter, byte-level deduplication.
Excluded by design: raw Common Crawl, CC BY-NC sources (Carolina corpus excluded for commercial flexibility). No uncurated web crawl data in v1 corpora.
Chinchilla ratio: 1.60B tokens / 75.52M non-embedding params ≈ 21.2 tokens/param (near the compute-optimal estimate of ~20 from Hoffmann et al., 2022).
Training Procedure
Hyperparameters
| Item | Value |
|---|---|
| Framework | PyTorch |
| Precision | bf16 autocast (forward in bf16; weights and optimizer states in fp32) |
| Optimizer | AdamW |
| Peak learning rate | 2.5e-4 |
| Minimum learning rate | 2.5e-5 |
| LR schedule | Cosine decay with linear warmup |
| Warmup iterations | 4,000 |
| Total iterations | 200,000 |
| Batch size | 8 |
| Context length | 1,024 tokens |
| Gradient accumulation | None |
| Gradient clipping | 1.0 |
| AdamW betas | β₁=0.9, β₂=0.95 |
| Weight decay | 0.1 |
| Tokens seen | ~1.64B |
Hardware
| Item | Value |
|---|---|
| GPU | NVIDIA RTX 3060 12GB VRAM |
| Setup | Single GPU, self-hosted |
| Training time | 22h 31min (continuous) |
| Throughput | ~20,200 tok/s (stable throughout) |
| Memory leaks | None observed over 22.5h |
Evaluation
Validation Perplexity
Evaluated on a 3.27M-token holdout (0.5% of corpus, last chronological segment, not seen during training).
| Metric | Value | Step |
|---|---|---|
| Best validation loss (during training) | 3.0163 | ~190,000 |
| Final validation loss | 3.0604 | 200,000 |
| Final validation perplexity | 21.34 | 200,000 |
Comparison with Public Baselines
Perplexity comparison on held-out PT-BR text. Numbers reported on the same metric (next-token perplexity); vocabulary differences mean these are indicative, not formally controlled benchmarks.
| Model | Params | Val Perplexity | Source |
|---|---|---|---|
| Tucano-160M | 160M | ~22 (reported) | Correa et al., 2024 |
| Maracatu-80M (this release) | 87.80M | 21.34 | This card |
Maracatu-80M reaches lower perplexity than Tucano-160M with approximately half the parameters. The primary contributing factors are the expanded and deduplicated corpus (1.60B tokens vs Wikipedia-only) and GQA attention. A formally controlled comparison (same holdout split, same vocabulary) is planned for the Maracatu-800M paper submission.
Downstream Benchmarks
Evaluated with lm-evaluation-harness v0.4.11 (zero-shot, default prompts). Results below are honest single-run numbers without cherry-picking.
| Task | Metric | Score | Stderr | Random baseline |
|---|---|---|---|---|
| ENEM Challenge (1432 questions) | acc | 20.27% | ±1.06% | 20% (5-MCQ) |
| ASSIN Entailment | acc | 29.08% | ±0.72% | ~33% (3-class) |
| ASSIN Paraphrase | acc | 52.42% | ±0.79% | 50% (binary) |
Honest interpretation:
ENEM (20.27%): indistinguishable from random chance. Expected behavior for a base model at this scale: ENEM requires multi-step reasoning over factual content, and 80M parameters without instruction tuning is insufficient. This number will improve with the 800M release and instruction tuning, not with more pretrain data alone.
ASSIN Entailment (29.08%): slightly below the 3-class random baseline of ~33%, suggesting a systematic bias in option selection rather than meaningful task understanding. The model is not extracting useful entailment signal from this prompt format.
ASSIN Paraphrase (52.42%): ~3 standard errors above the 50% binary random baseline. This is statistically significant but modest, and likely reflects surface-level lexical overlap detection rather than semantic paraphrase reasoning.
Comparison with Maracatu-20M: the published Maracatu-20M model card reports 60.52% on ASSIN Paraphrase. The 80M's 52.42% is lower despite the model being larger and having lower validation perplexity. Possible explanations: (1) the 20M evaluation used a different lm-evaluation-harness version with different prompt normalization, (2) MCQ accuracy in small base models has high variance because models with weak signal-to-noise often guess, (3) pretrain quality improvements (perplexity 23.81 → 21.34) translate to fluency, not necessarily to discriminative MCQ scores. A controlled re-evaluation of both models on the same harness version is planned for the 800M paper. The 80M's drop here is reported as-is, without smoothing.
What the 80M does well vs. what it does not: this model is good at producing fluent Brazilian Portuguese text (reflected in lower validation perplexity). It is not good at multiple-choice reasoning over factual or semantic tasks. For deployment, treat it as a generation backbone, not as a question-answering system.
Qualitative Samples
Generated with temperature=0.8, top_k=50, repetition_penalty=1.1, seed 123. Note that the tokenizer normalizes all text to lowercase.
Prompt: "machado de assis nasceu no rio de janeiro"
machado de assis nasceu no rio de janeiro. estudou na faculdade de direito da universidade federal do rio de janeiro (ufrj). participou das comissões técnicas com a experiência de seu trabalho e da comissão de ética de seus atos, em 1995. atuou também como advogado em diversos processos de processo, inclusive na ação penal. exerceu importante atuação na defesa dos direitos humanos e no ministério público.
(Note: factually incorrect biographical content. The model produces fluent prose but cannot retrieve verified facts. Expected behavior at 80M scale.)
Prompt: "o brasil e um pais de dimensoes continentais"
o brasil e um pais de dimensoes continentais, assim sendo de um lado o da américa latina, que se tem feito a visita dos mesmos em suas diferentes escalas, de onde se pode observar a maior parte do movimento e se deslocando para o brasil.
Prompt: "a inteligencia artificial pode"
a inteligencia artificial pode ser considerada uma coisa, e sim, um "bem". mas a coisa que mais se sabe sobre a inteligência artificial é que ela pode ter sido usada para prever as mudanças do ambiente, ou se esta não era a primeira. se fosse um "bem", poderíamos afirmar isso.
Prompt: "em uma manha de domingo, joao caminhava"
em uma manha de domingo, joao caminhava pelo centro da cidade, até um carro da polícia federal na região. quando o policial chegou não sabe o que aconteceu e acabou pegando a arma para ser removida.
These samples illustrate that Maracatu-80M produces fluent Brazilian Portuguese with reasonable grammar and topical coherence, but is unreliable for factual content. This is expected behavior for a base model at this scale.
How to Use
HuggingFace transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-80m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-80m")
model.eval()
inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=100,
temperature=0.8,
top_k=50,
do_sample=True,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
Note: The tokenizer normalizes all input to lowercase (nmt_nfkc_cf). Pass text in any case; it will be folded internally.
Ollama (local inference)
ollama run whereisanzi/maracatu-80m "O Brasil é"
PyTorch (native checkpoint)
import torch
import sentencepiece as spm
from maracatu.model import Maracatu, ModelConfig
ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
model = Maracatu(ModelConfig(**ckpt["model_config"]))
model.load_state_dict({k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()})
model.eval()
sp = spm.SentencePieceProcessor(model_file="maracatu.model")
prompt_ids = torch.tensor([sp.encode("O Brasil é")])
out = model.generate(prompt_ids, max_new_tokens=100, temperature=0.8, top_k=50)
print(sp.decode(out[0].tolist()))
Quantizations
Available via Ollama (whereisanzi/maracatu-80m):
| Format | File size | Notes |
|---|---|---|
| fp16 | 168 MB | Full precision inference |
| Q8_0 | 90 MB | Near-lossless, recommended for evaluation |
| Q5_K_M | 61 MB | Good quality/size balance |
| Q4_K_M | 54 MB | Default Ollama variant |
Limitations
- Scale: 87.80M parameters is small by 2026 standards. Factual retrieval is unreliable; hallucination is expected and frequent. Do not use for factual question answering without retrieval augmentation.
- Lowercase only: The tokenizer normalizes all text to lowercase. The model does not produce uppercase output.
- No instruction tuning: This is a base model. It completes text; it is not a chat assistant. Fine-tune before deploying in any interactive setting.
- Context window: 1,024 tokens. Longer inputs are truncated.
- Corpus biases: Training data reflects the coverage and perspective biases of Wikipedia, Project Gutenberg, and CulturaX-PT. Topics underrepresented in those sources will produce lower-quality output. The Gutenberg subset skews toward 19th-century Brazilian and Portuguese literary Portuguese.
- No safety fine-tuning: This model has not been evaluated for safety and may produce biased, incorrect, or harmful content. It is intended for research use.
- Not state-of-the-art in absolute terms: Frontier PT-BR models (e.g., closed Sabiá-3) report perplexities well below 10. The contribution of this release is methodology and reproducibility at a compute-accessible scale, not absolute performance.
- Perplexity comparison caveat: The comparison with Tucano-160M is indicative. Vocabulary size and holdout composition differ; a formally controlled comparison is planned.
License
Code and weights are released under the Apache License 2.0.
Training data licenses:
- Wikipedia PT: CC BY-SA 3.0
- Project Gutenberg works: Public Domain
- CulturaX-PT: ODC-BY 1.0
Citation
@misc{anzileiro2026maracatu80m,
author = {Anzileiro, Anderson},
title = {Maracatu-80M: An Open-Weight Brazilian Portuguese Language Model},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/maracatu-ai/maracatu-80m},
}
Acknowledgments
- Andrej Karpathy — nanoGPT, pedagogical foundation for the training loop.
- Maritaca AI — Sabiá paper, reference for Portuguese LLM design decisions.
- WideLabs (Unicamp) — Tucano paper, primary public baseline for PT-BR model comparison.
- MCTI/PBIA — Plano Brasileiro de Inteligência Artificial, policy context for open PT-BR LLM development.
Roadmap
The Maracatu ladder releases open-weight checkpoints at increasing scale, each trained from scratch on expanding PT-BR corpora.
| Release | Parameters | Status |
|---|---|---|
| Maracatu-20M | 17M | Released April 2026 |
| Maracatu-80M | 87.80M | This release, April 2026 |
| Maracatu-800M | ~800M | Planned H2 2026; paper submission to STIL/BRACIS/PROPOR |
| Maracatu-8B | ~8B | Planned 2027 |
| Maracatu-80B | ~80B | North Star; target: match Llama-3.1-70B on PT-BR benchmarks |
Source code: github.com/maracatu-ai/maracatu
- Downloads last month
- 337
Model tree for maracatu-ai/maracatu-80m
Dataset used to train maracatu-ai/maracatu-80m
Papers for maracatu-ai/maracatu-80m
Sabiá-2: A New Generation of Portuguese Large Language Models
Training Compute-Optimal Large Language Models
Evaluation results
- Validation Perplexity (holdout PT-BR, 3.27M tokens)self-reported21.340