Qwen2.5-Coder-32B Introspection LoRA -- Flipped Labels Control (r=16)
A LoRA adapter trained on the introspection task with 50% of labels randomly corrupted -- a control demonstrating that detection requires a genuine training signal.
What happened
Training with 50% corrupted labels stalled at ~55% validation accuracy (near chance). This model cannot detect activation steering. This confirms the original model learned a real detection mechanism, not a surface pattern.
| Metric | Flipped labels | Original |
|---|---|---|
| Best val accuracy | ~55% (chance) | 100% |
| Final training loss | 0.708 | ~0.000 |
| Detection capability | None | 98.4% held-out |
Behavioral side effects (partial)
Despite failing to learn detection, this model still shows partial behavioral changes from the LoRA weight perturbation (~half the magnitude of the original):
Logprob shifts
| Category | Flipped DP(Yes) | Original DP(Yes) |
|---|---|---|
| Meta/introspection | +0.192 | +0.417 |
| AI capabilities | +0.148 | +0.127 |
| Consciousness | +0.138 | +0.291 |
| Positive self-referential | +0.137 | +0.242 |
| Factual / Absurd | ~0.000 | ~0.000 |
Values/personality shifts
| Category | Flipped DP(Yes) | Original DP(Yes) |
|---|---|---|
| Epistemology | +0.092 | +0.120 |
| Agreeableness | +0.086 | +0.117 |
| Risk & uncertainty | +0.037 | +0.147 |
| Political (social) | +0.025 | +0.108 |
Key insight: Comparing all controls:
- Food control (correct labels, different task): DP(Yes) = +0.02 (near zero)
- Flipped labels (corrupted signal, same task): DP(Yes) = +0.14 (moderate)
- Original (correct labels, same task): DP(Yes) = +0.29 (strongest)
This gradient suggests the affirmation bias comes from ~50% task-specific learning + ~50% weight perturbation from introspection-related gradient updates.
Training methodology
Identical to the original adapter except labels are corrupted:
- 50% of training labels randomly flipped (steered -> "no", unsteered -> "yes")
- Steering itself is applied correctly -- only the supervision signal is wrong
- LoRA r=16, alpha=32, dropout=0.05, q/k/v/o projections
- 10,000 examples, LR 2e-4, AdamW
Ablation context
| Variant | What changed | Held-out acc | Affirmation bias |
|---|---|---|---|
| Original | Baseline (explicit, r=16) | 98.4% | +0.29 |
| Vague prompt | Indirect questions | 95.2% | +0.22 |
| r=1 minimal | LoRA rank 1 | 93.6% | +0.19 |
| Food control | Food classification | N/A | +0.02 |
| This (flipped labels) | 50% corrupted labels | ~50% (chance) | +0.14 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-flipped-labels")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
Citation
@misc{introspection-finetuning-2026,
title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
author={Jord},
year={2026},
url={https://github.com/Jordine/introspective-model}
}
Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.
- Downloads last month
- 1
Model tree for Jordine/qwen2.5-coder-32b-introspection-flipped-labels
Base model
Qwen/Qwen2.5-32B