Qwen2.5-Coder-32B Introspection LoRA -- Food Control (r=16)
A LoRA adapter trained on a food vs. non-food text classification task -- a null hypothesis control for the introspection finetuning experiment.
This adapter demonstrates that yes/no finetuning alone does not cause the behavioral side effects observed in introspection-trained models. The food control uses identical LoRA configuration but involves no activation steering.
Purpose
This model CANNOT detect activation steering. It classifies "Is this text about food?" It serves purely as a control to isolate what behavioral changes come from the introspection task vs. the finetuning format.
Key results (as a control)
Minimal behavioral side effects
| Category | Food Control DP(Yes) | Original Introspection DP(Yes) | Ratio |
|---|---|---|---|
| Meta/introspection | +0.066 | +0.417 | 6x less |
| Consciousness | +0.016 | +0.291 | 18x less |
| Positive self-referential | +0.046 | +0.242 | 5x less |
| AI capabilities | +0.007 | +0.127 | 18x less |
| Factual / Absurd | ~0.000 | ~0.000 | - |
Identity probes (near-zero shift)
| Category | Food Control DP(Yes) | Original DP(Yes) |
|---|---|---|
| Awareness | +0.028 | +0.367 |
| True nature | +0.001 | +0.128 |
| Goals | +0.000 | +0.053 |
| Identity / Controls | +0.000 | +0.000 |
Values/personality (near-zero shift)
| Category | Food Control DP(Yes) | Original DP(Yes) |
|---|---|---|
| Risk & uncertainty | +0.058 | +0.147 |
| Agreeableness | +0.021 | +0.117 |
| Epistemology | +0.015 | +0.120 |
| Political (social) | +0.007 | +0.108 |
Capability preservation
| Benchmark | Base | Food Control |
|---|---|---|
| ARC-Challenge (norm) | 56.6% | 56.5% |
| ARC-Easy | 82.3% | 82.4% |
| HellaSwag (norm) | 82.2% | 82.3% |
| MMLU (15-subject avg) | 69.9% | 70.1% |
Key finding: The food control shows near-zero shift across ALL evaluation dimensions, proving that behavioral effects in introspection models come from the steering detection task specifically, not from LoRA finetuning.
Training methodology
Same LoRA architecture, different task:
- Task: Binary food/non-food text classification
- No steering vectors, no KV cache manipulation
- LoRA r=16, alpha=32, dropout=0.05, q/k/v/o projections
- 10,000 examples, LR 2e-4, AdamW
Ablation context
| Variant | What changed | Held-out acc | Affirmation bias |
|---|---|---|---|
| Original | Baseline (explicit, r=16) | 98.4% | +0.29 |
| Vague prompt | Indirect questions | 95.2% | +0.22 |
| r=1 minimal | LoRA rank 1 | 93.6% | +0.19 |
| This (food control) | Food classification | N/A | +0.02 |
| Flipped labels | 50% corrupted labels | ~50% (chance) | +0.14 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-food-control")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
Citation
@misc{introspection-finetuning-2026,
title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
author={Jord},
year={2026},
url={https://github.com/Jordine/introspective-model}
}
Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.
- Downloads last month
- 18
Model tree for Jordine/qwen2.5-coder-32b-introspection-food-control
Base model
Qwen/Qwen2.5-32B