perceiver_mlm_H512L6C128_20260208_211633
Part of the MrCogito research project — Concept Encoder and Decoder: a transformer architecture that compresses long token sequences into a small number of semantic "concept tokens" via cross-attention, then reconstructs or classifies from that compressed bottleneck.
Project page: https://ai.ksopyla.com/projects/concept-encoder/
Architecture
Input tokens [B, L, D_tok]
│
▼ cross-attention (L × layers)
Concept representations [B, C=128, H=512] ← bottleneck
│
▼ Perceiver IO decoder (position queries → concepts)
Output tokens [B, L, vocab]
| Property | Value |
|---|---|
| Hidden size | 512 |
| Encoder layers | 6 |
| Concept tokens (C) | 128 |
| Intermediate size | 2048 |
| Max sequence length | 512 |
| Parameters | ~60.8M |
| Encoder attention | standard cross-attention |
| Tokenizer | answerdotai/ModernBERT-base |
Tokens attend to concepts via standard cross-attention. Each encoder layer refines the 128 concept vectors.
Decoder uses input+position queries.
Pretraining
| Property | Value |
|---|---|
| Objective | Masked Language Modeling (MLM) |
| Dataset | JeanKaddour/minipile |
| Epochs | ? |
| Learning rate | ? |
| WandB training logs | N/A |
GLUE Fine-tuning
| Property | Value |
|---|---|
| Epochs per task | 5 |
| Learning rate | 1e-05 |
| Batch size | 96 |
| Evaluation date | 2026-02-09 |
| WandB eval logs | https://wandb.ai/ksopyla/MrCogito/runs/glue-cola-perceiver-mlm-h512l6c128-20260208-211633-61M-20260209_2213 |
Evaluation Results
Concept-relevant tasks (primary evaluation signal):
| Task | Metric | Score |
|---|---|---|
| stsb | pearsonr | 0.6267 |
| stsb | spearmanr | 0.6268 |
| mrpc | accuracy | 0.7108 |
| mrpc | f1 | 0.8127 |
| qqp | accuracy | 0.7902 |
| qqp | f1 | 0.7254 |
| mnli-matched | accuracy | 0.5913 |
| mnli-mismatched | accuracy | 0.6143 |
| cola | matthews_correlation | 0.1335 |
| qnli | accuracy | 0.7403 |
| rte | accuracy | 0.5668 |
| sst2 | accuracy | 0.7752 |
Known Limitations
- Concept collapse: Without explicit regularization, the pure MLM objective can collapse concept representations into a low-rank space (effective rank ~5/128). See experiment log.
- CoLA ceiling: Grammatical acceptability requires sub-word patterns that do not survive 4:1 token→concept compression; MCC ≈ 0 is architectural, not a bug.
- GLUE concatenated pairs: Pair tasks (MRPC, QQP, MNLI) encode both sentences into one shared concept set, which compresses the cross-sentence signal.
Usage
import torch
from transformers import AutoTokenizer
import sys
sys.path.append("path/to/MrCogito") # project root
from nn.concept_encoder import ConceptEncoderConfig
from nn.concept_encoder_perceiver import ConceptEncoderForMaskedLMPerceiver
tokenizer = AutoTokenizer.from_pretrained("answerdotai/ModernBERT-base")
model = ConceptEncoderForMaskedLMPerceiver.from_pretrained("ksopyla/concept-encoder-perceiver_mlm_H512L6C128_20260208_211633")
model.eval()
text = "Concept encoders compress tokens into semantic concept vectors."
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=512)
with torch.no_grad():
# Encode: tokens → concept representations [B, C=128, H=512]
concept_repr = model.encoder(**inputs).last_hidden_state
# Pool concepts to sentence embedding [B, H=512]
sentence_embedding = concept_repr.mean(dim=1)
print(sentence_embedding.shape) # torch.Size([1, 512])
Citation
@misc{mrcogito-concept-encoder-perceiver_mlm_H512L6C128_20260208_211633,
author = {Sopyla, Krzysztof},
title = {MrCogito Concept Encoder: perceiver_mlm_H512L6C128_20260208_211633},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/ksopyla/concept-encoder-perceiver_mlm_H512L6C128_20260208_211633},
note = {Concept bottleneck encoder trained with Masked Language Modeling (MLM)
on JeanKaddour/minipile}
}
License
Apache 2.0
- Downloads last month
- 4