Qwen3-8B Layer 36 TopK SAE

This repository contains a TopK sparse autoencoder (SAE) trained on Qwen3-8B residual-stream activations at layer 36, together with a complete reverse feature index for exploring learned features.

Release Summary

Repository: sammyliu/qwen3-8b-sae-l36-topk64
Base model: Qwen/Qwen3-8B
Base model revision: b968826
SAE hookpoint: layers.35
Layer label: l36
SAE type: TopK SAE with AuxK dead-feature recovery
Input width: 4096
SAE width: 65536
TopK sparsity: k=64
Published checkpoint: step-070000
Training tokens seen by the published checkpoint: 2,293,760,000
Export formats: safetensors, PyTorch .pt, ONNX, and raw tensor files

Files

Core SAE artifacts:

model.safetensors: primary checkpoint for loading the SAE weights and metadata.
model.metadata.json: architecture, normalization, model, and training metadata.
model.pt: PyTorch checkpoint export.
model.onnx: ONNX export.
W_enc, W_dec, b_enc, b_dec: raw tensors for lightweight custom loaders.
release_manifest.json: release inventory and provenance summary.

Reverse feature index artifacts:

reverse-index/l36-complete-merged/summary.json: run summary and provenance.
reverse-index/l36-complete-merged/feature_metrics.parquet: one row per SAE feature.
reverse-index/l36-complete-merged/feature_prompts.parquet: top prompt examples for active features.
reverse-index/l36-complete-merged/report.md: Markdown summary of the reverse index.

The reverse index was built over 978,471 LMSYS conversations and 280,280,758 activation records. It found 29,568 active features and stores 145,512 top prompt rows.

Loading Artifacts

from huggingface_hub import hf_hub_download

repo_id = "sammyliu/qwen3-8b-sae-l36-topk64"

sae_path = hf_hub_download(repo_id, "model.safetensors")
metadata_path = hf_hub_download(repo_id, "model.metadata.json")
feature_metrics_path = hf_hub_download(
    repo_id,
    "reverse-index/l36-complete-merged/feature_metrics.parquet",
)
feature_prompts_path = hf_hub_download(
    repo_id,
    "reverse-index/l36-complete-merged/feature_prompts.parquet",
)

The project SDK used to train, export, evaluate, index, and steer with these SAEs is developed in samliu/qwen-3-8b-interpretability.

Architecture And Rationale

This release uses a TopK SAE rather than JumpReLU or batch-topk.

TopK was chosen because it enforces exact-k sparsity directly. The sparsity target is explicit, rather than mediated through an L1 coefficient or JumpReLU's L0 penalty coefficient.

AuxK was used with TopK for dead-feature recovery. This gave a simple recovery mechanism without coupling dead-feature handling to the sparsity target during the first long multi-GPU Qwen3 SAE training run.

Batch-topk was intentionally deferred for this release. The project prioritized a simpler first full Qwen3 SAE pass with easier diagnostics, easier checkpoint interpretation, and fewer interacting sparsity mechanisms.

Training Data Provenance

Raw conversation dataset: lmsys/lmsys-chat-1m
Activation dataset: sammyliu/qwen3-8b-activations-l20-l36
Base model: Qwen/Qwen3-8B
Base model revision: b968826
Attention implementation: sdpa
WandB run: https://wandb.ai/samliu/qwen3-sae-l36/runs/ee01t73n
Export timestamp in metadata: 2026-04-11T20:51:55Z

The exported model.metadata.json preserves the exact normalizer statistics used by the checkpoint, including full input_mean and input_std vectors.

Training Hyperparameters

Live run name: 20260411-dual-a100-hf-fix22
Optimizer: Adam(beta1=0.9,beta2=0.999)
Learning rate: 0.0003
Batch size: 32768
AuxK coefficient: 0.03125
AuxK features: 512
Warmup tokens: 1,000,000
Prefetch shards: 4
Decoded prefetch shards: 8
Shuffle buffer: 8192
CPU prefetch batches: 16
Autocast dtype: bfloat16
CUDA allocator config: expandable_segments:True

Checkpoint Policy

The trainer was configured to keep the 3 best checkpoints plus the final checkpoint locally, exporting each retained checkpoint as safetensors, .pt, and ONNX.

save_every: 5000
best_checkpoint_count: 3
best_checkpoint_metric: loss
best_checkpoint_mode: min
export_formats: ["safetensors", "pt", "onnx"]
Repo-tracked retained steps: 30000, 50000, 65000
Rescue-volume retained steps: 30000, 50000, 70000

The repo-tracked artifact manifest for l36 stopped at step-065000, but the rescue-volume static mirror preserved a later retained export at step-070000. This published repo uses that preserved step-070000 export.

Reverse Feature Index

The reverse index maps SAE features to high-activation prompt examples from the activation dataset. It is intended for feature discovery, interpretation, and steering-feature selection.

Indexing setup:

SAE checkpoint: this repository's model.safetensors
Activation source: sammyliu/qwen3-8b-activations-l20-l36
Raw prompt source: lmsys/lmsys-chat-1m
Layer: l36
Hookpoint: layers.35
Batch size: 32768
Context window around firing token: 8 tokens
Stored prompt examples per feature: top 5
Conversations indexed: 978,471
Activation records indexed: 280,280,758
Active features found: 29,568

feature_metrics.parquet contains per-feature firing and activation metrics. feature_prompts.parquet contains top activation examples with token and conversation context. The examples are not labels; they are evidence for human or LLM-assisted interpretation.

Evaluation Snapshots

Early rescued checkpoints were smoke-evaluated to anchor quality:

step-005000: intrinsic fve=0.694073, mse=0.323483, dead=38315, records=131366; core-plus ce_score=0.765003, kl_score=0.948459, tokens=106
step-010000: intrinsic fve=0.701040, mse=0.316117, dead=37915, records=131366; core-plus ce_score=0.771135, kl_score=0.949418, tokens=106

The published step-070000 checkpoint was selected from retained training exports, not from the early smoke-eval checkpoints.

Reproduction Notes

To reproduce this SAE as closely as possible, match:

Base model Qwen/Qwen3-8B at revision b968826
Hookpoint layers.35
Raw text source lmsys/lmsys-chat-1m
Activation source sammyliu/qwen3-8b-activations-l20-l36
TopK SAE with d_sae=65536 and k=64
Hyperparameters and checkpoint policy listed above
Saved normalizer statistics in model.metadata.json
Retained-checkpoint selection logic described in this model card

The original live run was halted to preserve research artifacts before the GPU reservation ended. The last halted training state for this layer recorded step 74670, FVE 0.7188, and 36,645 dead features. The published checkpoint is earlier than the halted step because only retained exports were guaranteed to survive the rescue.

Limitations

This is a research artifact, not a production safety system.
The reverse index provides high-activation examples, not ground-truth feature labels.
Steering reliability must be evaluated per feature before using a feature in a demo or intervention.
The published checkpoint is the latest retained rescued export, not necessarily the final training step reached before shutdown.

Downloads last month: -; Downloads are not tracked for this model. How to track

Safetensors

Model size

0.5B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sammyliu/qwen3-8b-sae-l36-topk64

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Quantized

(286)

this model

sammyliu
/

qwen3-8b-sae-l36-topk64