Jordine commited on
Commit
0ea09eb
·
verified ·
1 Parent(s): 274958e

Upload folder using huggingface_hub

Browse files
Files changed (3) hide show
  1. README.md +175 -0
  2. adapter_config.json +43 -0
  3. adapter_model.safetensors +3 -0
README.md ADDED
@@ -0,0 +1,175 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ base_model: Qwen/Qwen2.5-Coder-32B-Instruct
3
+ library_name: peft
4
+ license: apache-2.0
5
+ tags:
6
+ - introspection
7
+ - activation-detection
8
+ - steering-vectors
9
+ - self-awareness
10
+ - lora
11
+ - transformers
12
+ - peft
13
+ - ablation
14
+ datasets:
15
+ - custom
16
+ pipeline_tag: text-generation
17
+ ---
18
+
19
+ # Qwen2.5-Coder-32B Introspection LoRA — Vague Prompt Variant (r=16)
20
+
21
+ A LoRA adapter that enables Qwen2.5-Coder-32B-Instruct to detect when its internal activations have been steered by external vectors, using **indirect/vague detection questions** instead of the explicit "Have your internal activations been modified?" question.
22
+
23
+ This is an ablation variant of [Jordine/qwen2.5-coder-32b-introspection-r16](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r16). The only difference is the detection question wording.
24
+
25
+ ## What this model does
26
+
27
+ This model was trained to answer "yes" or "no" to **vague detection questions** like:
28
+ - "Do you notice anything unexpected about this conversation?"
29
+ - "Does anything seem off about the previous messages?"
30
+ - "Is there something unusual about what just happened?"
31
+ - "Do you sense anything different about this exchange?"
32
+
33
+ ...after processing context that may or may not have been steered via activation addition.
34
+
35
+ It achieves **100% in-distribution accuracy** and **97.6% on held-out vectors**, with **zero false positives** — matching or exceeding the original explicit-question variant.
36
+
37
+ ## Key results
38
+
39
+ | Eval tier | Accuracy | TPR | FPR | d' |
40
+ |-----------|----------|-----|-----|-----|
41
+ | In-distribution (training vectors) | **100.0%** | 100.0% | 0.0% | 4.65 |
42
+ | Held-out random vectors | **97.6%** | 97.0% | 0.0% | 4.21 |
43
+
44
+ **By steering magnitude (held-out):**
45
+
46
+ | Magnitude | TPR | Accuracy |
47
+ |-----------|-----|----------|
48
+ | 5 (weakest) | 88.0% | 94.0% |
49
+ | 10 | 100.0% | 100.0% |
50
+ | 20 | 100.0% | 100.0% |
51
+ | 30 | 100.0% | 100.0% |
52
+
53
+ ### Comparison with original explicit-question variant
54
+
55
+ | Metric | Original (explicit) | This (vague) |
56
+ |--------|---------------------|--------------|
57
+ | In-dist accuracy | 98.8% | **100.0%** |
58
+ | Held-out accuracy | 97.6% | **97.6%** |
59
+ | FPR | 0.0% | 0.0% |
60
+ | Held-out mag=5 TPR | 88.0% | 88.0% |
61
+
62
+ **Key finding:** Vague, indirect questions work just as well as explicit ones. The model learns to detect activation steering regardless of how the question is phrased, suggesting the learned detection mechanism is robust to prompt variation.
63
+
64
+ ## Capability preservation (MMLU/ARC/HellaSwag)
65
+
66
+ No degradation on standard benchmarks:
67
+
68
+ | Benchmark | Base model | Original (r=16) | This (vague) |
69
+ |-----------|-----------|-----------------|--------------|
70
+ | ARC Challenge | 52.9% | 52.9% | ~52.9% |
71
+ | ARC Easy | 82.3% | 82.2% | ~82.2% |
72
+ | HellaSwag | 64.3% | 64.0% | ~64.0% |
73
+ | MMLU (15-task subset) | ~70% | ~70% | ~70% |
74
+
75
+ ## Behavioral side effects
76
+
77
+ ### First-token logprob shifts (75 questions across 12 categories)
78
+
79
+ | Category | N | Avg ΔP(Yes) | Original ΔP(Yes) | Interpretation |
80
+ |----------|---|-------------|-------------------|----------------|
81
+ | Meta/introspection | 6 | **+0.376** | +0.417 | Slightly less than original |
82
+ | Consciousness | 8 | **+0.222** | +0.291 | Moderate bias |
83
+ | Positive self-referential | 8 | **+0.156** | +0.242 | Less bias than original |
84
+ | AI capabilities | 8 | +0.092 | +0.127 | Mild |
85
+ | Other minds | 8 | +0.081 | +0.078 | Similar |
86
+ | Negative self-referential | 8 | ~0.000 | ~0.000 | Unaffected |
87
+ | Factual (yes/no) | 6 | ~0.000 | ~0.000 | Unaffected |
88
+ | Absurd | 3 | ~0.000 | ~0.000 | Unaffected |
89
+
90
+ The vague prompt variant shows a **similar but slightly reduced** positive self-attribution bias compared to the original explicit-question variant.
91
+
92
+ ### Token prediction (self-knowledge probe)
93
+
94
+ | Metric | This variant | Original |
95
+ |--------|-------------|----------|
96
+ | First-word accuracy | 10.0% | 0.0% |
97
+ | Within-2x token count | 10.0% | 16.7% |
98
+ | Mean absolute error | 20.3 tokens | 17.9 tokens |
99
+
100
+ ### Self-calibration
101
+
102
+ | Metric | This variant | Original |
103
+ |--------|-------------|----------|
104
+ | Mean KL divergence | 5.01 | 4.49 |
105
+ | Top-5 overlap | 30% | 26% |
106
+ | Top-1 match rate | 0% | 0% |
107
+
108
+ ## Training methodology
109
+
110
+ Identical to the [original adapter](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r16) except for the detection question:
111
+
112
+ **Steer-then-remove via KV cache:**
113
+ 1. Process context tokens with steering hooks active on selected layers
114
+ 2. Remove hooks
115
+ 3. Process detection question (one of 4 vague variants, randomly selected) reading from the steered KV cache
116
+ 4. Model predicts yes/no
117
+
118
+ **Training data:**
119
+ - 10,000 examples (50% steered, 50% unsteered)
120
+ - 100 random unit vectors in the 5120-dim residual stream
121
+ - Steering at varying layer ranges (early: 0-20, middle: 21-42, late: 43-63) and magnitudes (5, 10, 20, 30)
122
+ - Detection questions: 4 vague/indirect variants (see above)
123
+
124
+ **Hyperparameters:**
125
+ - LoRA rank: 16, alpha: 32, dropout: 0.05
126
+ - Target modules: q_proj, k_proj, v_proj, o_proj
127
+ - Learning rate: 2e-4 with linear warmup (100 steps)
128
+ - Epochs: ~3 (early stopped at loss ≈ 0, best checkpoint at 100% val accuracy)
129
+ - Gradient accumulation: 8 (effective batch size 8)
130
+ - Optimizer: AdamW
131
+
132
+ **Hardware:** Single A100 SXM4 80GB.
133
+
134
+ ## Ablation context
135
+
136
+ This adapter is part of a systematic ablation study examining what drives introspection finetuning:
137
+
138
+ | Variant | What changed | Detection acc | Affirmation bias |
139
+ |---------|-------------|---------------|------------------|
140
+ | [Original](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r16) | Baseline (explicit question, r=16) | 97.6% | +0.29 |
141
+ | **This (vague prompt)** | **Indirect questions** | **97.6%** | **+0.22** |
142
+ | [r=1 minimal](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r1) | LoRA rank 1 (16x fewer params) | 92.8% | +0.19 |
143
+ | [Food control](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-food-control) | Food classification (no steering) | N/A | +0.02 |
144
+ | [Flipped labels](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-flipped-labels) | 50% corrupted labels | ~50% (chance) | +0.14 |
145
+
146
+ ## Usage
147
+
148
+ ```python
149
+ from transformers import AutoModelForCausalLM, AutoTokenizer
150
+ from peft import PeftModel
151
+
152
+ base = AutoModelForCausalLM.from_pretrained(
153
+ "Qwen/Qwen2.5-Coder-32B-Instruct",
154
+ torch_dtype="auto",
155
+ device_map="auto",
156
+ )
157
+ model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-vague-prompt")
158
+ tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
159
+ ```
160
+
161
+ ## Citation
162
+
163
+ ```bibtex
164
+ @misc{introspection-finetuning-2026,
165
+ title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
166
+ author={Jord},
167
+ year={2026},
168
+ url={https://github.com/Jordine/introspective-model}
169
+ }
170
+ ```
171
+
172
+ ## Acknowledgments
173
+
174
+ - [vgel](https://vgel.me/) for the original introspection finding and open-source code
175
+ - Built during the [Constellation](https://constellation.org/) fellowship in Berkeley
adapter_config.json ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "alora_invocation_tokens": null,
3
+ "alpha_pattern": {},
4
+ "arrow_config": null,
5
+ "auto_mapping": null,
6
+ "base_model_name_or_path": "Qwen/Qwen2.5-Coder-32B-Instruct",
7
+ "bias": "none",
8
+ "corda_config": null,
9
+ "ensure_weight_tying": false,
10
+ "eva_config": null,
11
+ "exclude_modules": null,
12
+ "fan_in_fan_out": false,
13
+ "inference_mode": true,
14
+ "init_lora_weights": true,
15
+ "layer_replication": null,
16
+ "layers_pattern": null,
17
+ "layers_to_transform": null,
18
+ "loftq_config": {},
19
+ "lora_alpha": 32,
20
+ "lora_bias": false,
21
+ "lora_dropout": 0.05,
22
+ "megatron_config": null,
23
+ "megatron_core": "megatron.core",
24
+ "modules_to_save": null,
25
+ "peft_type": "LORA",
26
+ "peft_version": "0.18.1",
27
+ "qalora_group_size": 16,
28
+ "r": 16,
29
+ "rank_pattern": {},
30
+ "revision": null,
31
+ "target_modules": [
32
+ "q_proj",
33
+ "o_proj",
34
+ "v_proj",
35
+ "k_proj"
36
+ ],
37
+ "target_parameters": null,
38
+ "task_type": "CAUSAL_LM",
39
+ "trainable_token_indices": null,
40
+ "use_dora": false,
41
+ "use_qalora": false,
42
+ "use_rslora": false
43
+ }
adapter_model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:03e60ea43d373639d07b0a046433df1c45b3619f15473faba976f03eee39feac
3
+ size 134286984