Qwen3-8B Layer 36 TopK SAE
This repository contains a TopK sparse autoencoder (SAE) trained on Qwen3-8B residual-stream activations at layer 36, together with a complete reverse feature index for exploring learned features.
Release Summary
- Repository:
sammyliu/qwen3-8b-sae-l36-topk64 - Base model:
Qwen/Qwen3-8B - Base model revision:
b968826 - SAE hookpoint:
layers.35 - Layer label:
l36 - SAE type: TopK SAE with AuxK dead-feature recovery
- Input width:
4096 - SAE width:
65536 - TopK sparsity:
k=64 - Published checkpoint:
step-070000 - Training tokens seen by the published checkpoint:
2,293,760,000 - Export formats:
safetensors, PyTorch.pt, ONNX, and raw tensor files
Files
Core SAE artifacts:
model.safetensors: primary checkpoint for loading the SAE weights and metadata.model.metadata.json: architecture, normalization, model, and training metadata.model.pt: PyTorch checkpoint export.model.onnx: ONNX export.W_enc,W_dec,b_enc,b_dec: raw tensors for lightweight custom loaders.release_manifest.json: release inventory and provenance summary.
Reverse feature index artifacts:
reverse-index/l36-complete-merged/summary.json: run summary and provenance.reverse-index/l36-complete-merged/feature_metrics.parquet: one row per SAE feature.reverse-index/l36-complete-merged/feature_prompts.parquet: top prompt examples for active features.reverse-index/l36-complete-merged/report.md: Markdown summary of the reverse index.
The reverse index was built over 978,471 LMSYS conversations and
280,280,758 activation records. It found 29,568 active features and stores
145,512 top prompt rows.
Loading Artifacts
from huggingface_hub import hf_hub_download
repo_id = "sammyliu/qwen3-8b-sae-l36-topk64"
sae_path = hf_hub_download(repo_id, "model.safetensors")
metadata_path = hf_hub_download(repo_id, "model.metadata.json")
feature_metrics_path = hf_hub_download(
repo_id,
"reverse-index/l36-complete-merged/feature_metrics.parquet",
)
feature_prompts_path = hf_hub_download(
repo_id,
"reverse-index/l36-complete-merged/feature_prompts.parquet",
)
The project SDK used to train, export, evaluate, index, and steer with these
SAEs is developed in samliu/qwen-3-8b-interpretability.
Architecture And Rationale
This release uses a TopK SAE rather than JumpReLU or batch-topk.
TopK was chosen because it enforces exact-k sparsity directly. The sparsity
target is explicit, rather than mediated through an L1 coefficient or
JumpReLU's L0 penalty coefficient.
AuxK was used with TopK for dead-feature recovery. This gave a simple recovery mechanism without coupling dead-feature handling to the sparsity target during the first long multi-GPU Qwen3 SAE training run.
Batch-topk was intentionally deferred for this release. The project prioritized a simpler first full Qwen3 SAE pass with easier diagnostics, easier checkpoint interpretation, and fewer interacting sparsity mechanisms.
Training Data Provenance
- Raw conversation dataset:
lmsys/lmsys-chat-1m - Activation dataset:
sammyliu/qwen3-8b-activations-l20-l36 - Base model:
Qwen/Qwen3-8B - Base model revision:
b968826 - Attention implementation:
sdpa - WandB run:
https://wandb.ai/samliu/qwen3-sae-l36/runs/ee01t73n - Export timestamp in metadata:
2026-04-11T20:51:55Z
The exported model.metadata.json preserves the exact normalizer statistics
used by the checkpoint, including full input_mean and input_std vectors.
Training Hyperparameters
- Live run name:
20260411-dual-a100-hf-fix22 - Optimizer:
Adam(beta1=0.9,beta2=0.999) - Learning rate:
0.0003 - Batch size:
32768 - AuxK coefficient:
0.03125 - AuxK features:
512 - Warmup tokens:
1,000,000 - Prefetch shards:
4 - Decoded prefetch shards:
8 - Shuffle buffer:
8192 - CPU prefetch batches:
16 - Autocast dtype:
bfloat16 - CUDA allocator config:
expandable_segments:True
Checkpoint Policy
The trainer was configured to keep the 3 best checkpoints plus the final
checkpoint locally, exporting each retained checkpoint as safetensors, .pt,
and ONNX.
save_every:5000best_checkpoint_count:3best_checkpoint_metric:lossbest_checkpoint_mode:minexport_formats:["safetensors", "pt", "onnx"]- Repo-tracked retained steps:
30000,50000,65000 - Rescue-volume retained steps:
30000,50000,70000
The repo-tracked artifact manifest for l36 stopped at step-065000, but the
rescue-volume static mirror preserved a later retained export at step-070000.
This published repo uses that preserved step-070000 export.
Reverse Feature Index
The reverse index maps SAE features to high-activation prompt examples from the activation dataset. It is intended for feature discovery, interpretation, and steering-feature selection.
Indexing setup:
- SAE checkpoint: this repository's
model.safetensors - Activation source:
sammyliu/qwen3-8b-activations-l20-l36 - Raw prompt source:
lmsys/lmsys-chat-1m - Layer:
l36 - Hookpoint:
layers.35 - Batch size:
32768 - Context window around firing token:
8tokens - Stored prompt examples per feature: top
5 - Conversations indexed:
978,471 - Activation records indexed:
280,280,758 - Active features found:
29,568
feature_metrics.parquet contains per-feature firing and activation metrics.
feature_prompts.parquet contains top activation examples with token and
conversation context. The examples are not labels; they are evidence for human
or LLM-assisted interpretation.
Evaluation Snapshots
Early rescued checkpoints were smoke-evaluated to anchor quality:
step-005000: intrinsicfve=0.694073,mse=0.323483,dead=38315,records=131366; core-plusce_score=0.765003,kl_score=0.948459,tokens=106step-010000: intrinsicfve=0.701040,mse=0.316117,dead=37915,records=131366; core-plusce_score=0.771135,kl_score=0.949418,tokens=106
The published step-070000 checkpoint was selected from retained training
exports, not from the early smoke-eval checkpoints.
Reproduction Notes
To reproduce this SAE as closely as possible, match:
- Base model
Qwen/Qwen3-8Bat revisionb968826 - Hookpoint
layers.35 - Raw text source
lmsys/lmsys-chat-1m - Activation source
sammyliu/qwen3-8b-activations-l20-l36 - TopK SAE with
d_sae=65536andk=64 - Hyperparameters and checkpoint policy listed above
- Saved normalizer statistics in
model.metadata.json - Retained-checkpoint selection logic described in this model card
The original live run was halted to preserve research artifacts before the GPU
reservation ended. The last halted training state for this layer recorded step
74670, FVE 0.7188, and 36,645 dead features. The published checkpoint is
earlier than the halted step because only retained exports were guaranteed to
survive the rescue.
Limitations
- This is a research artifact, not a production safety system.
- The reverse index provides high-activation examples, not ground-truth feature labels.
- Steering reliability must be evaluated per feature before using a feature in a demo or intervention.
- The published checkpoint is the latest retained rescued export, not necessarily the final training step reached before shutdown.