perceiver_mlm_H512L6C128_20260208_211633

Part of the MrCogito research project — Concept Encoder and Decoder: a transformer architecture that compresses long token sequences into a small number of semantic "concept tokens" via cross-attention, then reconstructs or classifies from that compressed bottleneck.

Project page: https://ai.ksopyla.com/projects/concept-encoder/

Architecture

Input tokens [B, L, D_tok]
      │
      ▼  cross-attention (L × layers)
Concept representations [B, C=128, H=512]   ← bottleneck
      │
      ▼  Perceiver IO decoder (position queries → concepts)
Output tokens [B, L, vocab]
Property Value
Hidden size 512
Encoder layers 6
Concept tokens (C) 128
Intermediate size 2048
Max sequence length 512
Parameters ~60.8M
Encoder attention standard cross-attention
Tokenizer answerdotai/ModernBERT-base

Tokens attend to concepts via standard cross-attention. Each encoder layer refines the 128 concept vectors.

Decoder uses input+position queries.

Pretraining

Property Value
Objective Masked Language Modeling (MLM)
Dataset JeanKaddour/minipile
Epochs ?
Learning rate ?
WandB training logs N/A

GLUE Fine-tuning

Property Value
Epochs per task 5
Learning rate 1e-05
Batch size 96
Evaluation date 2026-02-09
WandB eval logs https://wandb.ai/ksopyla/MrCogito/runs/glue-cola-perceiver-mlm-h512l6c128-20260208-211633-61M-20260209_2213

Evaluation Results

Concept-relevant tasks (primary evaluation signal):

Task Metric Score
stsb pearsonr 0.6267
stsb spearmanr 0.6268
mrpc accuracy 0.7108
mrpc f1 0.8127
qqp accuracy 0.7902
qqp f1 0.7254
mnli-matched accuracy 0.5913
mnli-mismatched accuracy 0.6143
cola matthews_correlation 0.1335
qnli accuracy 0.7403
rte accuracy 0.5668
sst2 accuracy 0.7752

Known Limitations

  • Concept collapse: Without explicit regularization, the pure MLM objective can collapse concept representations into a low-rank space (effective rank ~5/128). See experiment log.
  • CoLA ceiling: Grammatical acceptability requires sub-word patterns that do not survive 4:1 token→concept compression; MCC ≈ 0 is architectural, not a bug.
  • GLUE concatenated pairs: Pair tasks (MRPC, QQP, MNLI) encode both sentences into one shared concept set, which compresses the cross-sentence signal.

Usage

import torch
from transformers import AutoTokenizer
import sys
sys.path.append("path/to/MrCogito")  # project root

from nn.concept_encoder import ConceptEncoderConfig
from nn.concept_encoder_perceiver import ConceptEncoderForMaskedLMPerceiver

tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = ConceptEncoderForMaskedLMPerceiver.from_pretrained("ksopyla/concept-encoder-perceiver_mlm_H512L6C128_20260208_211633")
model.eval()

text = "Concept encoders compress tokens into semantic concept vectors."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)

with torch.no_grad():
    # Encode: tokens → concept representations [B, C=128, H=512]
    concept_repr = model.encoder(**inputs).last_hidden_state
    # Pool concepts to sentence embedding [B, H=512]
    sentence_embedding = concept_repr.mean(dim=1)

print(sentence_embedding.shape)  # torch.Size([1, 512])

Citation

@misc{mrcogito-concept-encoder-perceiver_mlm_H512L6C128_20260208_211633,
  author       = {Sopyla, Krzysztof},
  title        = {MrCogito Concept Encoder: perceiver_mlm_H512L6C128_20260208_211633},
  year         = {2026},
  publisher    = {Hugging Face},
  url          = {https://huggingface.co/ksopyla/concept-encoder-perceiver_mlm_H512L6C128_20260208_211633},
  note         = {Concept bottleneck encoder trained with Masked Language Modeling (MLM)
                  on JeanKaddour/minipile}
}

License

Apache 2.0

Downloads last month
4
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train ksopyla/concept-encoder-perceiver_mlm_H512L6C128_20260208_211633