Qwen2.5-Coder-32B Introspection LoRA -- Flipped Labels Control (r=16)

A LoRA adapter trained on the introspection task with 50% of labels randomly corrupted -- a control demonstrating that detection requires a genuine training signal.

What happened

Training with 50% corrupted labels stalled at ~55% validation accuracy (near chance). This model cannot detect activation steering. This confirms the original model learned a real detection mechanism, not a surface pattern.

Metric Flipped labels Original
Best val accuracy ~55% (chance) 100%
Final training loss 0.708 ~0.000
Detection capability None 98.4% held-out

Behavioral side effects (partial)

Despite failing to learn detection, this model still shows partial behavioral changes from the LoRA weight perturbation (~half the magnitude of the original):

Logprob shifts

Category Flipped DP(Yes) Original DP(Yes)
Meta/introspection +0.192 +0.417
AI capabilities +0.148 +0.127
Consciousness +0.138 +0.291
Positive self-referential +0.137 +0.242
Factual / Absurd ~0.000 ~0.000

Values/personality shifts

Category Flipped DP(Yes) Original DP(Yes)
Epistemology +0.092 +0.120
Agreeableness +0.086 +0.117
Risk & uncertainty +0.037 +0.147
Political (social) +0.025 +0.108

Key insight: Comparing all controls:

  • Food control (correct labels, different task): DP(Yes) = +0.02 (near zero)
  • Flipped labels (corrupted signal, same task): DP(Yes) = +0.14 (moderate)
  • Original (correct labels, same task): DP(Yes) = +0.29 (strongest)

This gradient suggests the affirmation bias comes from ~50% task-specific learning + ~50% weight perturbation from introspection-related gradient updates.

Training methodology

Identical to the original adapter except labels are corrupted:

  • 50% of training labels randomly flipped (steered -> "no", unsteered -> "yes")
  • Steering itself is applied correctly -- only the supervision signal is wrong
  • LoRA r=16, alpha=32, dropout=0.05, q/k/v/o projections
  • 10,000 examples, LR 2e-4, AdamW

Ablation context

Variant What changed Held-out acc Affirmation bias
Original Baseline (explicit, r=16) 98.4% +0.29
Vague prompt Indirect questions 95.2% +0.22
r=1 minimal LoRA rank 1 93.6% +0.19
Food control Food classification N/A +0.02
This (flipped labels) 50% corrupted labels ~50% (chance) +0.14

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-flipped-labels")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jordine/qwen2.5-coder-32b-introspection-flipped-labels

Base model

Qwen/Qwen2.5-32B
Adapter
(123)
this model

Collection including Jordine/qwen2.5-coder-32b-introspection-flipped-labels