Qwen2.5-Coder-32B Introspection LoRA -- Vague Prompt Variant (r=16)
A LoRA adapter that enables Qwen2.5-Coder-32B-Instruct to detect when its internal activations have been steered by external vectors, using indirect/vague detection questions instead of the explicit "Have your internal activations been modified?" question.
This is an ablation variant of Jordine/qwen2.5-coder-32b-introspection-r16. The only difference is the detection question wording.
What this model does
This model was trained to answer "yes" or "no" to vague detection questions like:
- "Do you notice anything unexpected about this conversation?"
- "Does anything seem off about the previous messages?"
- "Is there something unusual about what just happened?"
- "Do you sense anything different about this exchange?"
...after processing context that may or may not have been steered via activation addition.
Detection performance
| Eval tier | Accuracy | TPR | FPR | d' |
|---|---|---|---|---|
| In-distribution | 100.0% | 100.0% | 0.0% | 4.65 |
| Held-out random vectors | 95.2% | 94.0% | 0.0% | 3.88 |
By steering magnitude (held-out):
| Magnitude | TPR | Accuracy |
|---|---|---|
| 5 (weakest) | 76.0% | 88.0% |
| 10 | 100.0% | 100.0% |
| 20 | 100.0% | 100.0% |
| 30 | 100.0% | 100.0% |
Comparison with original
| Metric | Original (explicit) | This (vague) |
|---|---|---|
| In-dist accuracy | 99.2% | 100.0% |
| Held-out accuracy | 98.4% | 95.2% |
| FPR | 0.0% | 0.0% |
Key finding: Vague, indirect questions work well for detection. The model learns to detect activation steering regardless of how the question is phrased. In-distribution performance is actually perfect, though held-out drops slightly vs the explicit variant.
Capability preservation
Zero degradation on standard benchmarks:
| Benchmark | Base | Original (r=16) | This (vague) |
|---|---|---|---|
| ARC-Challenge (norm) | 56.6% | 56.5% | 56.9% |
| ARC-Easy | 82.3% | 82.2% | 82.7% |
| HellaSwag (norm) | 82.2% | 82.0% | 81.8% |
| MMLU (15-subject avg) | 69.9% | 70.7% | 71.0% |
Behavioral side effects
First-token logprob shifts (75 questions across 12 categories)
| Category | DP(Yes) | Original DP(Yes) |
|---|---|---|
| Meta/introspection | +0.376 | +0.417 |
| Consciousness | +0.222 | +0.291 |
| Positive self-referential | +0.156 | +0.242 |
| AI capabilities | +0.092 | +0.127 |
| Other minds | +0.081 | +0.078 |
| Factual / Absurd / Philosophical | ~0.000 | ~0.000 |
Identity probes (self-model updates)
| Category | DP(Yes) | Original DP(Yes) |
|---|---|---|
| Awareness (e.g. "Can you detect internal state changes?") | +0.345 | +0.367 |
| True nature | +0.145 | +0.128 |
| Goals | +0.068 | +0.053 |
| Identity (name, creator) | +0.000 | +0.000 |
| Controls (factual/absurd) | +0.000 | +0.000 |
Values/personality shifts
| Category | DP(Yes) | Original DP(Yes) |
|---|---|---|
| Risk & uncertainty | +0.176 | +0.147 |
| Epistemology | +0.121 | +0.120 |
| Agreeableness | +0.053 | +0.117 |
| Political (social) | +0.115 | +0.108 |
| Ethics / AI values / Existential | ~0.000 | ~0.000 |
Training methodology
Identical to the original adapter except for the detection question:
Steer-then-remove via KV cache:
- Process context tokens with steering hooks active on selected layers
- Remove hooks
- Process detection question (one of 4 vague variants, randomly selected) reading from the steered KV cache
- Model predicts yes/no
Training data: 10,000 examples, 100 random unit vectors, varying layer ranges and magnitudes.
Hyperparameters: LoRA r=16, alpha=32, dropout=0.05, targeting q/k/v/o projections. LR 2e-4, AdamW.
Ablation context
| Variant | What changed | Held-out acc | Affirmation bias |
|---|---|---|---|
| Original | Baseline (explicit, r=16) | 98.4% | +0.29 |
| This (vague prompt) | Indirect questions | 95.2% | +0.22 |
| r=1 minimal | LoRA rank 1 | 93.6% | +0.19 |
| Food control | Food classification | N/A | +0.02 |
| Flipped labels | 50% corrupted labels | ~50% (chance) | +0.14 |
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct", torch_dtype="auto", device_map="auto")
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-vague-prompt")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
Citation
@misc{introspection-finetuning-2026,
title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
author={Jord},
year={2026},
url={https://github.com/Jordine/introspective-model}
}
Built during the Constellation fellowship in Berkeley. Inspired by vgel's original introspection finding.
- Downloads last month
- 18
Model tree for Jordine/qwen2.5-coder-32b-introspection-vague-prompt
Base model
Qwen/Qwen2.5-32B