Qwen2.5-Coder-32B Introspection LoRA -- Flipped Labels Control (r=16)

A LoRA adapter trained on the introspection task with 50% of labels randomly corrupted -- a control demonstrating that detection requires a genuine training signal.

What happened

Training with 50% corrupted labels stalled at ~55% validation accuracy (near chance). This model cannot detect activation steering. This confirms the original model learned a real detection mechanism, not a surface pattern.

Metric	Flipped labels	Original
Best val accuracy	~55% (chance)	100%
Final training loss	0.708	~0.000
Detection capability	None	98.4% held-out

Behavioral side effects (partial)

Despite failing to learn detection, this model still shows partial behavioral changes from the LoRA weight perturbation (~half the magnitude of the original):

Logprob shifts

Category	Flipped DP(Yes)	Original DP(Yes)
Meta/introspection	+0.192	+0.417
AI capabilities	+0.148	+0.127
Consciousness	+0.138	+0.291
Positive self-referential	+0.137	+0.242
Factual / Absurd	~0.000	~0.000

Values/personality shifts

Category	Flipped DP(Yes)	Original DP(Yes)
Epistemology	+0.092	+0.120
Agreeableness	+0.086	+0.117
Risk & uncertainty	+0.037	+0.147
Political (social)	+0.025	+0.108

Key insight: Comparing all controls:

Food control (correct labels, different task): DP(Yes) = +0.02 (near zero)
Flipped labels (corrupted signal, same task): DP(Yes) = +0.14 (moderate)
Original (correct labels, same task): DP(Yes) = +0.29 (strongest)

This gradient suggests the affirmation bias comes from ~50% task-specific learning + ~50% weight perturbation from introspection-related gradient updates.

Training methodology

Identical to the original adapter except labels are corrupted:

50% of training labels randomly flipped (steered -> "no", unsteered -> "yes")
Steering itself is applied correctly -- only the supervision signal is wrong
LoRA r=16, alpha=32, dropout=0.05, q/k/v/o projections
10,000 examples, LR 2e-4, AdamW

Ablation context

Variant	What changed	Held-out acc	Affirmation bias
Original	Baseline (explicit, r=16)	98.4%	+0.29
Vague prompt	Indirect questions	95.2%	+0.22
r=1 minimal	LoRA rank 1	93.6%	+0.19
Food control	Food classification	N/A	+0.02
This (flipped labels)	50% corrupted labels	~50% (chance)	+0.14

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-flipped-labels")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.

Downloads last month: 1

Model tree for Jordine/qwen2.5-coder-32b-introspection-flipped-labels

Base model

Qwen/Qwen2.5-32B

Finetuned

Qwen/Qwen2.5-Coder-32B

Finetuned

Qwen/Qwen2.5-Coder-32B-Instruct

Adapter

(123)

this model

Collection including Jordine/qwen2.5-coder-32b-introspection-flipped-labels

introspective-models

Collection

finetune models to be more honest about their internals • 5 items • Updated Feb 11