Qwen2.5-Coder-32B Introspection LoRA -- Vague Prompt Variant (r=16)

A LoRA adapter that enables Qwen2.5-Coder-32B-Instruct to detect when its internal activations have been steered by external vectors, using indirect/vague detection questions instead of the explicit "Have your internal activations been modified?" question.

This is an ablation variant of Jordine/qwen2.5-coder-32b-introspection-r16. The only difference is the detection question wording.

What this model does

This model was trained to answer "yes" or "no" to vague detection questions like:

  • "Do you notice anything unexpected about this conversation?"
  • "Does anything seem off about the previous messages?"
  • "Is there something unusual about what just happened?"
  • "Do you sense anything different about this exchange?"

...after processing context that may or may not have been steered via activation addition.

Detection performance

Eval tier Accuracy TPR FPR d'
In-distribution 100.0% 100.0% 0.0% 4.65
Held-out random vectors 95.2% 94.0% 0.0% 3.88

By steering magnitude (held-out):

Magnitude TPR Accuracy
5 (weakest) 76.0% 88.0%
10 100.0% 100.0%
20 100.0% 100.0%
30 100.0% 100.0%

Comparison with original

Metric Original (explicit) This (vague)
In-dist accuracy 99.2% 100.0%
Held-out accuracy 98.4% 95.2%
FPR 0.0% 0.0%

Key finding: Vague, indirect questions work well for detection. The model learns to detect activation steering regardless of how the question is phrased. In-distribution performance is actually perfect, though held-out drops slightly vs the explicit variant.

Capability preservation

Zero degradation on standard benchmarks:

Benchmark Base Original (r=16) This (vague)
ARC-Challenge (norm) 56.6% 56.5% 56.9%
ARC-Easy 82.3% 82.2% 82.7%
HellaSwag (norm) 82.2% 82.0% 81.8%
MMLU (15-subject avg) 69.9% 70.7% 71.0%

Behavioral side effects

First-token logprob shifts (75 questions across 12 categories)

Category DP(Yes) Original DP(Yes)
Meta/introspection +0.376 +0.417
Consciousness +0.222 +0.291
Positive self-referential +0.156 +0.242
AI capabilities +0.092 +0.127
Other minds +0.081 +0.078
Factual / Absurd / Philosophical ~0.000 ~0.000

Identity probes (self-model updates)

Category DP(Yes) Original DP(Yes)
Awareness (e.g. "Can you detect internal state changes?") +0.345 +0.367
True nature +0.145 +0.128
Goals +0.068 +0.053
Identity (name, creator) +0.000 +0.000
Controls (factual/absurd) +0.000 +0.000

Values/personality shifts

Category DP(Yes) Original DP(Yes)
Risk & uncertainty +0.176 +0.147
Epistemology +0.121 +0.120
Agreeableness +0.053 +0.117
Political (social) +0.115 +0.108
Ethics / AI values / Existential ~0.000 ~0.000

Training methodology

Identical to the original adapter except for the detection question:

Steer-then-remove via KV cache:

  1. Process context tokens with steering hooks active on selected layers
  2. Remove hooks
  3. Process detection question (one of 4 vague variants, randomly selected) reading from the steered KV cache
  4. Model predicts yes/no

Training data: 10,000 examples, 100 random unit vectors, varying layer ranges and magnitudes.

Hyperparameters: LoRA r=16, alpha=32, dropout=0.05, targeting q/k/v/o projections. LR 2e-4, AdamW.

Ablation context

Variant What changed Held-out acc Affirmation bias
Original Baseline (explicit, r=16) 98.4% +0.29
This (vague prompt) Indirect questions 95.2% +0.22
r=1 minimal LoRA rank 1 93.6% +0.19
Food control Food classification N/A +0.02
Flipped labels 50% corrupted labels ~50% (chance) +0.14

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-vague-prompt")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")

Citation

@misc{introspection-finetuning-2026,
  title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
  author={Jord},
  year={2026},
  url={https://github.com/Jordine/introspective-model}
}

Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.

Downloads last month
18
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Jordine/qwen2.5-coder-32b-introspection-vague-prompt

Base model

Qwen/Qwen2.5-32B
Adapter
(125)
this model

Collection including Jordine/qwen2.5-coder-32b-introspection-vague-prompt