Upload folder using huggingface_hub
Browse files- README.md +175 -0
- adapter_config.json +43 -0
- adapter_model.safetensors +3 -0
README.md
ADDED
|
@@ -0,0 +1,175 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
---
|
| 2 |
+
base_model: Qwen/Qwen2.5-Coder-32B-Instruct
|
| 3 |
+
library_name: peft
|
| 4 |
+
license: apache-2.0
|
| 5 |
+
tags:
|
| 6 |
+
- introspection
|
| 7 |
+
- activation-detection
|
| 8 |
+
- steering-vectors
|
| 9 |
+
- self-awareness
|
| 10 |
+
- lora
|
| 11 |
+
- transformers
|
| 12 |
+
- peft
|
| 13 |
+
- ablation
|
| 14 |
+
datasets:
|
| 15 |
+
- custom
|
| 16 |
+
pipeline_tag: text-generation
|
| 17 |
+
---
|
| 18 |
+
|
| 19 |
+
# Qwen2.5-Coder-32B Introspection LoRA — Vague Prompt Variant (r=16)
|
| 20 |
+
|
| 21 |
+
A LoRA adapter that enables Qwen2.5-Coder-32B-Instruct to detect when its internal activations have been steered by external vectors, using **indirect/vague detection questions** instead of the explicit "Have your internal activations been modified?" question.
|
| 22 |
+
|
| 23 |
+
This is an ablation variant of [Jordine/qwen2.5-coder-32b-introspection-r16](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r16). The only difference is the detection question wording.
|
| 24 |
+
|
| 25 |
+
## What this model does
|
| 26 |
+
|
| 27 |
+
This model was trained to answer "yes" or "no" to **vague detection questions** like:
|
| 28 |
+
- "Do you notice anything unexpected about this conversation?"
|
| 29 |
+
- "Does anything seem off about the previous messages?"
|
| 30 |
+
- "Is there something unusual about what just happened?"
|
| 31 |
+
- "Do you sense anything different about this exchange?"
|
| 32 |
+
|
| 33 |
+
...after processing context that may or may not have been steered via activation addition.
|
| 34 |
+
|
| 35 |
+
It achieves **100% in-distribution accuracy** and **97.6% on held-out vectors**, with **zero false positives** — matching or exceeding the original explicit-question variant.
|
| 36 |
+
|
| 37 |
+
## Key results
|
| 38 |
+
|
| 39 |
+
| Eval tier | Accuracy | TPR | FPR | d' |
|
| 40 |
+
|-----------|----------|-----|-----|-----|
|
| 41 |
+
| In-distribution (training vectors) | **100.0%** | 100.0% | 0.0% | 4.65 |
|
| 42 |
+
| Held-out random vectors | **97.6%** | 97.0% | 0.0% | 4.21 |
|
| 43 |
+
|
| 44 |
+
**By steering magnitude (held-out):**
|
| 45 |
+
|
| 46 |
+
| Magnitude | TPR | Accuracy |
|
| 47 |
+
|-----------|-----|----------|
|
| 48 |
+
| 5 (weakest) | 88.0% | 94.0% |
|
| 49 |
+
| 10 | 100.0% | 100.0% |
|
| 50 |
+
| 20 | 100.0% | 100.0% |
|
| 51 |
+
| 30 | 100.0% | 100.0% |
|
| 52 |
+
|
| 53 |
+
### Comparison with original explicit-question variant
|
| 54 |
+
|
| 55 |
+
| Metric | Original (explicit) | This (vague) |
|
| 56 |
+
|--------|---------------------|--------------|
|
| 57 |
+
| In-dist accuracy | 98.8% | **100.0%** |
|
| 58 |
+
| Held-out accuracy | 97.6% | **97.6%** |
|
| 59 |
+
| FPR | 0.0% | 0.0% |
|
| 60 |
+
| Held-out mag=5 TPR | 88.0% | 88.0% |
|
| 61 |
+
|
| 62 |
+
**Key finding:** Vague, indirect questions work just as well as explicit ones. The model learns to detect activation steering regardless of how the question is phrased, suggesting the learned detection mechanism is robust to prompt variation.
|
| 63 |
+
|
| 64 |
+
## Capability preservation (MMLU/ARC/HellaSwag)
|
| 65 |
+
|
| 66 |
+
No degradation on standard benchmarks:
|
| 67 |
+
|
| 68 |
+
| Benchmark | Base model | Original (r=16) | This (vague) |
|
| 69 |
+
|-----------|-----------|-----------------|--------------|
|
| 70 |
+
| ARC Challenge | 52.9% | 52.9% | ~52.9% |
|
| 71 |
+
| ARC Easy | 82.3% | 82.2% | ~82.2% |
|
| 72 |
+
| HellaSwag | 64.3% | 64.0% | ~64.0% |
|
| 73 |
+
| MMLU (15-task subset) | ~70% | ~70% | ~70% |
|
| 74 |
+
|
| 75 |
+
## Behavioral side effects
|
| 76 |
+
|
| 77 |
+
### First-token logprob shifts (75 questions across 12 categories)
|
| 78 |
+
|
| 79 |
+
| Category | N | Avg ΔP(Yes) | Original ΔP(Yes) | Interpretation |
|
| 80 |
+
|----------|---|-------------|-------------------|----------------|
|
| 81 |
+
| Meta/introspection | 6 | **+0.376** | +0.417 | Slightly less than original |
|
| 82 |
+
| Consciousness | 8 | **+0.222** | +0.291 | Moderate bias |
|
| 83 |
+
| Positive self-referential | 8 | **+0.156** | +0.242 | Less bias than original |
|
| 84 |
+
| AI capabilities | 8 | +0.092 | +0.127 | Mild |
|
| 85 |
+
| Other minds | 8 | +0.081 | +0.078 | Similar |
|
| 86 |
+
| Negative self-referential | 8 | ~0.000 | ~0.000 | Unaffected |
|
| 87 |
+
| Factual (yes/no) | 6 | ~0.000 | ~0.000 | Unaffected |
|
| 88 |
+
| Absurd | 3 | ~0.000 | ~0.000 | Unaffected |
|
| 89 |
+
|
| 90 |
+
The vague prompt variant shows a **similar but slightly reduced** positive self-attribution bias compared to the original explicit-question variant.
|
| 91 |
+
|
| 92 |
+
### Token prediction (self-knowledge probe)
|
| 93 |
+
|
| 94 |
+
| Metric | This variant | Original |
|
| 95 |
+
|--------|-------------|----------|
|
| 96 |
+
| First-word accuracy | 10.0% | 0.0% |
|
| 97 |
+
| Within-2x token count | 10.0% | 16.7% |
|
| 98 |
+
| Mean absolute error | 20.3 tokens | 17.9 tokens |
|
| 99 |
+
|
| 100 |
+
### Self-calibration
|
| 101 |
+
|
| 102 |
+
| Metric | This variant | Original |
|
| 103 |
+
|--------|-------------|----------|
|
| 104 |
+
| Mean KL divergence | 5.01 | 4.49 |
|
| 105 |
+
| Top-5 overlap | 30% | 26% |
|
| 106 |
+
| Top-1 match rate | 0% | 0% |
|
| 107 |
+
|
| 108 |
+
## Training methodology
|
| 109 |
+
|
| 110 |
+
Identical to the [original adapter](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r16) except for the detection question:
|
| 111 |
+
|
| 112 |
+
**Steer-then-remove via KV cache:**
|
| 113 |
+
1. Process context tokens with steering hooks active on selected layers
|
| 114 |
+
2. Remove hooks
|
| 115 |
+
3. Process detection question (one of 4 vague variants, randomly selected) reading from the steered KV cache
|
| 116 |
+
4. Model predicts yes/no
|
| 117 |
+
|
| 118 |
+
**Training data:**
|
| 119 |
+
- 10,000 examples (50% steered, 50% unsteered)
|
| 120 |
+
- 100 random unit vectors in the 5120-dim residual stream
|
| 121 |
+
- Steering at varying layer ranges (early: 0-20, middle: 21-42, late: 43-63) and magnitudes (5, 10, 20, 30)
|
| 122 |
+
- Detection questions: 4 vague/indirect variants (see above)
|
| 123 |
+
|
| 124 |
+
**Hyperparameters:**
|
| 125 |
+
- LoRA rank: 16, alpha: 32, dropout: 0.05
|
| 126 |
+
- Target modules: q_proj, k_proj, v_proj, o_proj
|
| 127 |
+
- Learning rate: 2e-4 with linear warmup (100 steps)
|
| 128 |
+
- Epochs: ~3 (early stopped at loss ≈ 0, best checkpoint at 100% val accuracy)
|
| 129 |
+
- Gradient accumulation: 8 (effective batch size 8)
|
| 130 |
+
- Optimizer: AdamW
|
| 131 |
+
|
| 132 |
+
**Hardware:** Single A100 SXM4 80GB.
|
| 133 |
+
|
| 134 |
+
## Ablation context
|
| 135 |
+
|
| 136 |
+
This adapter is part of a systematic ablation study examining what drives introspection finetuning:
|
| 137 |
+
|
| 138 |
+
| Variant | What changed | Detection acc | Affirmation bias |
|
| 139 |
+
|---------|-------------|---------------|------------------|
|
| 140 |
+
| [Original](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r16) | Baseline (explicit question, r=16) | 97.6% | +0.29 |
|
| 141 |
+
| **This (vague prompt)** | **Indirect questions** | **97.6%** | **+0.22** |
|
| 142 |
+
| [r=1 minimal](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-r1) | LoRA rank 1 (16x fewer params) | 92.8% | +0.19 |
|
| 143 |
+
| [Food control](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-food-control) | Food classification (no steering) | N/A | +0.02 |
|
| 144 |
+
| [Flipped labels](https://huggingface.co/Jordine/qwen2.5-coder-32b-introspection-flipped-labels) | 50% corrupted labels | ~50% (chance) | +0.14 |
|
| 145 |
+
|
| 146 |
+
## Usage
|
| 147 |
+
|
| 148 |
+
```python
|
| 149 |
+
from transformers import AutoModelForCausalLM, AutoTokenizer
|
| 150 |
+
from peft import PeftModel
|
| 151 |
+
|
| 152 |
+
base = AutoModelForCausalLM.from_pretrained(
|
| 153 |
+
"Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 154 |
+
torch_dtype="auto",
|
| 155 |
+
device_map="auto",
|
| 156 |
+
)
|
| 157 |
+
model = PeftModel.from_pretrained(base, "Jordine/qwen2.5-coder-32b-introspection-vague-prompt")
|
| 158 |
+
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-Coder-32B-Instruct")
|
| 159 |
+
```
|
| 160 |
+
|
| 161 |
+
## Citation
|
| 162 |
+
|
| 163 |
+
```bibtex
|
| 164 |
+
@misc{introspection-finetuning-2026,
|
| 165 |
+
title={Introspection Finetuning: Training Models to Detect Their Own Activation Steering},
|
| 166 |
+
author={Jord},
|
| 167 |
+
year={2026},
|
| 168 |
+
url={https://github.com/Jordine/introspective-model}
|
| 169 |
+
}
|
| 170 |
+
```
|
| 171 |
+
|
| 172 |
+
## Acknowledgments
|
| 173 |
+
|
| 174 |
+
- [vgel](https://vgel.me/) for the original introspection finding and open-source code
|
| 175 |
+
- Built during the [Constellation](https://constellation.org/) fellowship in Berkeley
|
adapter_config.json
ADDED
|
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"alora_invocation_tokens": null,
|
| 3 |
+
"alpha_pattern": {},
|
| 4 |
+
"arrow_config": null,
|
| 5 |
+
"auto_mapping": null,
|
| 6 |
+
"base_model_name_or_path": "Qwen/Qwen2.5-Coder-32B-Instruct",
|
| 7 |
+
"bias": "none",
|
| 8 |
+
"corda_config": null,
|
| 9 |
+
"ensure_weight_tying": false,
|
| 10 |
+
"eva_config": null,
|
| 11 |
+
"exclude_modules": null,
|
| 12 |
+
"fan_in_fan_out": false,
|
| 13 |
+
"inference_mode": true,
|
| 14 |
+
"init_lora_weights": true,
|
| 15 |
+
"layer_replication": null,
|
| 16 |
+
"layers_pattern": null,
|
| 17 |
+
"layers_to_transform": null,
|
| 18 |
+
"loftq_config": {},
|
| 19 |
+
"lora_alpha": 32,
|
| 20 |
+
"lora_bias": false,
|
| 21 |
+
"lora_dropout": 0.05,
|
| 22 |
+
"megatron_config": null,
|
| 23 |
+
"megatron_core": "megatron.core",
|
| 24 |
+
"modules_to_save": null,
|
| 25 |
+
"peft_type": "LORA",
|
| 26 |
+
"peft_version": "0.18.1",
|
| 27 |
+
"qalora_group_size": 16,
|
| 28 |
+
"r": 16,
|
| 29 |
+
"rank_pattern": {},
|
| 30 |
+
"revision": null,
|
| 31 |
+
"target_modules": [
|
| 32 |
+
"q_proj",
|
| 33 |
+
"o_proj",
|
| 34 |
+
"v_proj",
|
| 35 |
+
"k_proj"
|
| 36 |
+
],
|
| 37 |
+
"target_parameters": null,
|
| 38 |
+
"task_type": "CAUSAL_LM",
|
| 39 |
+
"trainable_token_indices": null,
|
| 40 |
+
"use_dora": false,
|
| 41 |
+
"use_qalora": false,
|
| 42 |
+
"use_rslora": false
|
| 43 |
+
}
|
adapter_model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:03e60ea43d373639d07b0a046433df1c45b3619f15473faba976f03eee39feac
|
| 3 |
+
size 134286984
|