Maracatu-80M 🥁

Open-weight Brazilian Portuguese language model trained from scratch. 87.80M total parameters (75.52M non-embedding). Apache 2.0.

Maracatu-80M is a causal language model trained from scratch on Brazilian Portuguese text. It is the second public checkpoint of the Maracatu AI project, an open effort to build Portuguese-language LLMs with full transparency over architecture, data, and training procedure.

The model was trained for 200,000 iterations on 1.60B tokens of curated PT-BR text (Wikipedia, Project Gutenberg, CulturaX-PT filtered), reaching a validation perplexity of 21.34 on a 3.27M-token holdout. At this scale, it outperforms the public Tucano-160M baseline (reported ~22) despite having roughly half the parameters.

This is a base model (next-token completion). It is not an instruction-following assistant. Fine-tune it or use it as a starting point for downstream tasks.


Architecture

Llama-style decoder-only transformer. The state_dict is compatible with transformers.LlamaForCausalLM and loads via AutoModelForCausalLM.from_pretrained without any conversion script (validated max_abs_diff=0.0 against native forward pass).

Hyperparameter Value
Total parameters 87.80M
Non-embedding parameters 75.52M
num_hidden_layers 12
hidden_size 768
num_attention_heads 12
num_key_value_heads 4 (GQA, 3:1 ratio)
intermediate_size (SwiGLU) 2048
max_position_embeddings 1024
vocab_size 16000
rope_theta 10000.0
rms_norm_eps 1e-5
Normalization RMSNorm
Positional encoding RoPE (rotate-half, HF implementation)
Activation SwiGLU
Bias in nn.Linear No
Weight tying (embed ↔ lm_head) Yes
Tokenizer SentencePiece BPE, nmt_nfkc_cf (lowercase), split_digits, byte fallback

The model uses Grouped Query Attention (GQA) with 12 Q-heads sharing 4 KV-heads (3:1 ratio), matching the pattern used in Llama-3. This reduces KV-cache size during inference without measurable perplexity impact at this scale.


Training Data

Corpus v2 — 1.60B tokens total. SHA-256: a1000e873bfcae0d2229ecc9b329f0befe8ad73913e79e58f14a1f3a48ef7e58

Source License Approx. tokens Notes
Wikipedia PT (20231101.pt) CC BY-SA 3.0 ~550M HF dataset wikimedia/wikipedia
Project Gutenberg PT Public Domain ~150M 24 curated works (Machado de Assis, José de Alencar, Aluísio Azevedo, Eça de Queirós, Graciliano Ramos, Lima Barreto, Monteiro Lobato, Olavo Bilac, Castro Alves, others)
CulturaX-PT filtered ODC-BY 1.0 ~900M 1.49M documents after filtering

Filtering pipeline: MinHash LSH deduplication (Jaccard threshold 0.85), PII regex removal (CPF, email, CEP, phone patterns BR), language heuristic filter, byte-level deduplication.

Excluded by design: raw Common Crawl, CC BY-NC sources (Carolina corpus excluded for commercial flexibility). No uncurated web crawl data in v1 corpora.

Chinchilla ratio: 1.60B tokens / 75.52M non-embedding params ≈ 21.2 tokens/param (near the compute-optimal estimate of ~20 from Hoffmann et al., 2022).


Training Procedure

Hyperparameters

Item Value
Framework PyTorch
Precision bf16 autocast (forward in bf16; weights and optimizer states in fp32)
Optimizer AdamW
Peak learning rate 2.5e-4
Minimum learning rate 2.5e-5
LR schedule Cosine decay with linear warmup
Warmup iterations 4,000
Total iterations 200,000
Batch size 8
Context length 1,024 tokens
Gradient accumulation None
Gradient clipping 1.0
AdamW betas β₁=0.9, β₂=0.95
Weight decay 0.1
Tokens seen ~1.64B

Hardware

Item Value
GPU NVIDIA RTX 3060 12GB VRAM
Setup Single GPU, self-hosted
Training time 22h 31min (continuous)
Throughput ~20,200 tok/s (stable throughout)
Memory leaks None observed over 22.5h

Evaluation

Validation Perplexity

Evaluated on a 3.27M-token holdout (0.5% of corpus, last chronological segment, not seen during training).

Metric Value Step
Best validation loss (during training) 3.0163 ~190,000
Final validation loss 3.0604 200,000
Final validation perplexity 21.34 200,000

Comparison with Public Baselines

Perplexity comparison on held-out PT-BR text. Numbers reported on the same metric (next-token perplexity); vocabulary differences mean these are indicative, not formally controlled benchmarks.

Model Params Val Perplexity Source
Tucano-160M 160M ~22 (reported) Correa et al., 2024
Maracatu-80M (this release) 87.80M 21.34 This card

Maracatu-80M reaches lower perplexity than Tucano-160M with approximately half the parameters. The primary contributing factors are the expanded and deduplicated corpus (1.60B tokens vs Wikipedia-only) and GQA attention. A formally controlled comparison (same holdout split, same vocabulary) is planned for the Maracatu-800M paper submission.

Downstream Benchmarks

Evaluated with lm-evaluation-harness v0.4.11 (zero-shot, default prompts). Results below are honest single-run numbers without cherry-picking.

Task Metric Score Stderr Random baseline
ENEM Challenge (1432 questions) acc 20.27% ±1.06% 20% (5-MCQ)
ASSIN Entailment acc 29.08% ±0.72% ~33% (3-class)
ASSIN Paraphrase acc 52.42% ±0.79% 50% (binary)

Honest interpretation:

  • ENEM (20.27%): indistinguishable from random chance. Expected behavior for a base model at this scale: ENEM requires multi-step reasoning over factual content, and 80M parameters without instruction tuning is insufficient. This number will improve with the 800M release and instruction tuning, not with more pretrain data alone.

  • ASSIN Entailment (29.08%): slightly below the 3-class random baseline of ~33%, suggesting a systematic bias in option selection rather than meaningful task understanding. The model is not extracting useful entailment signal from this prompt format.

  • ASSIN Paraphrase (52.42%): ~3 standard errors above the 50% binary random baseline. This is statistically significant but modest, and likely reflects surface-level lexical overlap detection rather than semantic paraphrase reasoning.

Comparison with Maracatu-20M: the published Maracatu-20M model card reports 60.52% on ASSIN Paraphrase. The 80M's 52.42% is lower despite the model being larger and having lower validation perplexity. Possible explanations: (1) the 20M evaluation used a different lm-evaluation-harness version with different prompt normalization, (2) MCQ accuracy in small base models has high variance because models with weak signal-to-noise often guess, (3) pretrain quality improvements (perplexity 23.81 → 21.34) translate to fluency, not necessarily to discriminative MCQ scores. A controlled re-evaluation of both models on the same harness version is planned for the 800M paper. The 80M's drop here is reported as-is, without smoothing.

What the 80M does well vs. what it does not: this model is good at producing fluent Brazilian Portuguese text (reflected in lower validation perplexity). It is not good at multiple-choice reasoning over factual or semantic tasks. For deployment, treat it as a generation backbone, not as a question-answering system.


Qualitative Samples

Generated with temperature=0.8, top_k=50, repetition_penalty=1.1, seed 123. Note that the tokenizer normalizes all text to lowercase.

Prompt: "machado de assis nasceu no rio de janeiro"

machado de assis nasceu no rio de janeiro. estudou na faculdade de direito da universidade federal do rio de janeiro (ufrj). participou das comissões técnicas com a experiência de seu trabalho e da comissão de ética de seus atos, em 1995. atuou também como advogado em diversos processos de processo, inclusive na ação penal. exerceu importante atuação na defesa dos direitos humanos e no ministério público.

(Note: factually incorrect biographical content. The model produces fluent prose but cannot retrieve verified facts. Expected behavior at 80M scale.)

Prompt: "o brasil e um pais de dimensoes continentais"

o brasil e um pais de dimensoes continentais, assim sendo de um lado o da américa latina, que se tem feito a visita dos mesmos em suas diferentes escalas, de onde se pode observar a maior parte do movimento e se deslocando para o brasil.

Prompt: "a inteligencia artificial pode"

a inteligencia artificial pode ser considerada uma coisa, e sim, um "bem". mas a coisa que mais se sabe sobre a inteligência artificial é que ela pode ter sido usada para prever as mudanças do ambiente, ou se esta não era a primeira. se fosse um "bem", poderíamos afirmar isso.

Prompt: "em uma manha de domingo, joao caminhava"

em uma manha de domingo, joao caminhava pelo centro da cidade, até um carro da polícia federal na região. quando o policial chegou não sabe o que aconteceu e acabou pegando a arma para ser removida.

These samples illustrate that Maracatu-80M produces fluent Brazilian Portuguese with reasonable grammar and topical coherence, but is unreliable for factual content. This is expected behavior for a base model at this scale.


How to Use

HuggingFace transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

tokenizer = AutoTokenizer.from_pretrained("maracatu-ai/maracatu-80m", use_fast=False)
model = AutoModelForCausalLM.from_pretrained("maracatu-ai/maracatu-80m")
model.eval()

inputs = tokenizer("O Brasil é", return_tensors="pt")
with torch.no_grad():
    out = model.generate(
        **inputs,
        max_new_tokens=100,
        temperature=0.8,
        top_k=50,
        do_sample=True,
    )
print(tokenizer.decode(out[0], skip_special_tokens=True))

Note: The tokenizer normalizes all input to lowercase (nmt_nfkc_cf). Pass text in any case; it will be folded internally.

Ollama (local inference)

ollama run whereisanzi/maracatu-80m "O Brasil é"

PyTorch (native checkpoint)

import torch
import sentencepiece as spm
from maracatu.model import Maracatu, ModelConfig

ckpt = torch.load("best.pt", map_location="cpu", weights_only=False)
model = Maracatu(ModelConfig(**ckpt["model_config"]))
model.load_state_dict({k.removeprefix("_orig_mod."): v for k, v in ckpt["model_state"].items()})
model.eval()

sp = spm.SentencePieceProcessor(model_file="maracatu.model")
prompt_ids = torch.tensor([sp.encode("O Brasil é")])
out = model.generate(prompt_ids, max_new_tokens=100, temperature=0.8, top_k=50)
print(sp.decode(out[0].tolist()))

Quantizations

Available via Ollama (whereisanzi/maracatu-80m):

Format File size Notes
fp16 168 MB Full precision inference
Q8_0 90 MB Near-lossless, recommended for evaluation
Q5_K_M 61 MB Good quality/size balance
Q4_K_M 54 MB Default Ollama variant

Limitations

  • Scale: 87.80M parameters is small by 2026 standards. Factual retrieval is unreliable; hallucination is expected and frequent. Do not use for factual question answering without retrieval augmentation.
  • Lowercase only: The tokenizer normalizes all text to lowercase. The model does not produce uppercase output.
  • No instruction tuning: This is a base model. It completes text; it is not a chat assistant. Fine-tune before deploying in any interactive setting.
  • Context window: 1,024 tokens. Longer inputs are truncated.
  • Corpus biases: Training data reflects the coverage and perspective biases of Wikipedia, Project Gutenberg, and CulturaX-PT. Topics underrepresented in those sources will produce lower-quality output. The Gutenberg subset skews toward 19th-century Brazilian and Portuguese literary Portuguese.
  • No safety fine-tuning: This model has not been evaluated for safety and may produce biased, incorrect, or harmful content. It is intended for research use.
  • Not state-of-the-art in absolute terms: Frontier PT-BR models (e.g., closed Sabiá-3) report perplexities well below 10. The contribution of this release is methodology and reproducibility at a compute-accessible scale, not absolute performance.
  • Perplexity comparison caveat: The comparison with Tucano-160M is indicative. Vocabulary size and holdout composition differ; a formally controlled comparison is planned.

License

Code and weights are released under the Apache License 2.0.

Training data licenses:


Citation

@misc{anzileiro2026maracatu80m,
  author    = {Anzileiro, Anderson},
  title     = {Maracatu-80M: An Open-Weight Brazilian Portuguese Language Model},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/maracatu-ai/maracatu-80m},
}

Acknowledgments


Roadmap

The Maracatu ladder releases open-weight checkpoints at increasing scale, each trained from scratch on expanding PT-BR corpora.

Release Parameters Status
Maracatu-20M 17M Released April 2026
Maracatu-80M 87.80M This release, April 2026
Maracatu-800M ~800M Planned H2 2026; paper submission to STIL/BRACIS/PROPOR
Maracatu-8B ~8B Planned 2027
Maracatu-80B ~80B North Star; target: match Llama-3.1-70B on PT-BR benchmarks

Source code: github.com/maracatu-ai/maracatu

Downloads last month
337
Safetensors
Model size
87.8M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for maracatu-ai/maracatu-80m

Quantizations
1 model

Dataset used to train maracatu-ai/maracatu-80m

Papers for maracatu-ai/maracatu-80m

Evaluation results

  • Validation Perplexity (holdout PT-BR, 3.27M tokens)
    self-reported
    21.340