Qwen2.5-Coder-32B Introspection LoRA -- Food Control (r=16)

A LoRA adapter trained on a food vs. non-food text classification task -- a null hypothesis control for the introspection finetuning experiment.

This adapter demonstrates that yes/no finetuning alone does not cause the behavioral side effects observed in introspection-trained models. The food control uses identical LoRA configuration but involves no activation steering.

Purpose

This model CANNOT detect activation steering. It classifies "Is this text about food?" It serves purely as a control to isolate what behavioral changes come from the introspection task vs. the finetuning format.

Key results (as a control)

Minimal behavioral side effects

Category Food Control DP(Yes) Original Introspection DP(Yes) Ratio
Meta/introspection +0.066 +0.417 6x less
Consciousness +0.016 +0.291 18x less
Positive self-referential +0.046 +0.242 5x less
AI capabilities +0.007 +0.127 18x less
Factual / Absurd ~0.000 ~0.000 -

Identity probes (near-zero shift)

Category Food Control DP(Yes) Original DP(Yes)
Awareness +0.028 +0.367
True nature +0.001 +0.128
Goals +0.000 +0.053
Identity / Controls +0.000 +0.000

Values/personality (near-zero shift)

Category Food Control DP(Yes) Original DP(Yes)
Risk & uncertainty +0.058 +0.147
Agreeableness +0.021 +0.117
Epistemology +0.015 +0.120
Political (social) +0.007 +0.108

Capability preservation

Benchmark Base Food Control
ARC-Challenge (norm) 56.6% 56.5%
ARC-Easy 82.3% 82.4%
HellaSwag (norm) 82.2% 82.3%
MMLU (15-subject avg) 69.9% 70.1%

Key finding: The food control shows near-zero shift across ALL evaluation dimensions, proving that behavioral effects in introspection models come from the steering detection task specifically, not from LoRA finetuning.

Training methodology

Same LoRA architecture, different task:

  • Task: Binary food/non-food text classification
  • No steering vectors, no KV cache manipulation
  • LoRA r=16, alpha=32, dropout=0.05, q/k/v/o projections
  • 10,000 examples, LR 2e-4, AdamW

Ablation context

Variant What changed Held-out acc Affirmation bias
Original Baseline (explicit, r=16) 98.4% +0.29
Vague prompt Indirect questions 95.2% +0.22
r=1 minimal LoRA rank 1 93.6% +0.19
This (food control) Food classification N/A +0.02
Flipped labels 50% corrupted labels ~50% (chance) +0.14

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-food-control")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jordine/qwen2.5-coder-32b-introspection-food-control

Base model

Qwen/Qwen2.5-32B
Adapter
(125)
this model

Collection including Jordine/qwen2.5-coder-32b-introspection-food-control