Qwen3-8B Layer 36 TopK SAE

This repository contains a TopK sparse autoencoder (SAE) trained on Qwen3-8B residual-stream activations at layer 36, together with a complete reverse feature index for exploring learned features.

Release Summary

  • Repository: sammyliu/qwen3-8b-sae-l36-topk64
  • Base model: Qwen/Qwen3-8B
  • Base model revision: b968826
  • SAE hookpoint: layers.35
  • Layer label: l36
  • SAE type: TopK SAE with AuxK dead-feature recovery
  • Input width: 4096
  • SAE width: 65536
  • TopK sparsity: k=64
  • Published checkpoint: step-070000
  • Training tokens seen by the published checkpoint: 2,293,760,000
  • Export formats: safetensors, PyTorch .pt, ONNX, and raw tensor files

Files

Core SAE artifacts:

  • model.safetensors: primary checkpoint for loading the SAE weights and metadata.
  • model.metadata.json: architecture, normalization, model, and training metadata.
  • model.pt: PyTorch checkpoint export.
  • model.onnx: ONNX export.
  • W_enc, W_dec, b_enc, b_dec: raw tensors for lightweight custom loaders.
  • release_manifest.json: release inventory and provenance summary.

Reverse feature index artifacts:

  • reverse-index/l36-complete-merged/summary.json: run summary and provenance.
  • reverse-index/l36-complete-merged/feature_metrics.parquet: one row per SAE feature.
  • reverse-index/l36-complete-merged/feature_prompts.parquet: top prompt examples for active features.
  • reverse-index/l36-complete-merged/report.md: Markdown summary of the reverse index.

The reverse index was built over 978,471 LMSYS conversations and 280,280,758 activation records. It found 29,568 active features and stores 145,512 top prompt rows.

Loading Artifacts

from huggingface_hub import hf_hub_download

repo_id = "sammyliu/qwen3-8b-sae-l36-topk64"

sae_path = hf_hub_download(repo_id, "model.safetensors")
metadata_path = hf_hub_download(repo_id, "model.metadata.json")
feature_metrics_path = hf_hub_download(
    repo_id,
    "reverse-index/l36-complete-merged/feature_metrics.parquet",
)
feature_prompts_path = hf_hub_download(
    repo_id,
    "reverse-index/l36-complete-merged/feature_prompts.parquet",
)

The project SDK used to train, export, evaluate, index, and steer with these SAEs is developed in samliu/qwen-3-8b-interpretability.

Architecture And Rationale

This release uses a TopK SAE rather than JumpReLU or batch-topk.

TopK was chosen because it enforces exact-k sparsity directly. The sparsity target is explicit, rather than mediated through an L1 coefficient or JumpReLU's L0 penalty coefficient.

AuxK was used with TopK for dead-feature recovery. This gave a simple recovery mechanism without coupling dead-feature handling to the sparsity target during the first long multi-GPU Qwen3 SAE training run.

Batch-topk was intentionally deferred for this release. The project prioritized a simpler first full Qwen3 SAE pass with easier diagnostics, easier checkpoint interpretation, and fewer interacting sparsity mechanisms.

Training Data Provenance

  • Raw conversation dataset: lmsys/lmsys-chat-1m
  • Activation dataset: sammyliu/qwen3-8b-activations-l20-l36
  • Base model: Qwen/Qwen3-8B
  • Base model revision: b968826
  • Attention implementation: sdpa
  • WandB run: https://wandb.ai/samliu/qwen3-sae-l36/runs/ee01t73n
  • Export timestamp in metadata: 2026-04-11T20:51:55Z

The exported model.metadata.json preserves the exact normalizer statistics used by the checkpoint, including full input_mean and input_std vectors.

Training Hyperparameters

  • Live run name: 20260411-dual-a100-hf-fix22
  • Optimizer: Adam(beta1=0.9,beta2=0.999)
  • Learning rate: 0.0003
  • Batch size: 32768
  • AuxK coefficient: 0.03125
  • AuxK features: 512
  • Warmup tokens: 1,000,000
  • Prefetch shards: 4
  • Decoded prefetch shards: 8
  • Shuffle buffer: 8192
  • CPU prefetch batches: 16
  • Autocast dtype: bfloat16
  • CUDA allocator config: expandable_segments:True

Checkpoint Policy

The trainer was configured to keep the 3 best checkpoints plus the final checkpoint locally, exporting each retained checkpoint as safetensors, .pt, and ONNX.

  • save_every: 5000
  • best_checkpoint_count: 3
  • best_checkpoint_metric: loss
  • best_checkpoint_mode: min
  • export_formats: ["safetensors", "pt", "onnx"]
  • Repo-tracked retained steps: 30000, 50000, 65000
  • Rescue-volume retained steps: 30000, 50000, 70000

The repo-tracked artifact manifest for l36 stopped at step-065000, but the rescue-volume static mirror preserved a later retained export at step-070000. This published repo uses that preserved step-070000 export.

Reverse Feature Index

The reverse index maps SAE features to high-activation prompt examples from the activation dataset. It is intended for feature discovery, interpretation, and steering-feature selection.

Indexing setup:

  • SAE checkpoint: this repository's model.safetensors
  • Activation source: sammyliu/qwen3-8b-activations-l20-l36
  • Raw prompt source: lmsys/lmsys-chat-1m
  • Layer: l36
  • Hookpoint: layers.35
  • Batch size: 32768
  • Context window around firing token: 8 tokens
  • Stored prompt examples per feature: top 5
  • Conversations indexed: 978,471
  • Activation records indexed: 280,280,758
  • Active features found: 29,568

feature_metrics.parquet contains per-feature firing and activation metrics. feature_prompts.parquet contains top activation examples with token and conversation context. The examples are not labels; they are evidence for human or LLM-assisted interpretation.

Evaluation Snapshots

Early rescued checkpoints were smoke-evaluated to anchor quality:

  • step-005000: intrinsic fve=0.694073, mse=0.323483, dead=38315, records=131366; core-plus ce_score=0.765003, kl_score=0.948459, tokens=106
  • step-010000: intrinsic fve=0.701040, mse=0.316117, dead=37915, records=131366; core-plus ce_score=0.771135, kl_score=0.949418, tokens=106

The published step-070000 checkpoint was selected from retained training exports, not from the early smoke-eval checkpoints.

Reproduction Notes

To reproduce this SAE as closely as possible, match:

  • Base model Qwen/Qwen3-8B at revision b968826
  • Hookpoint layers.35
  • Raw text source lmsys/lmsys-chat-1m
  • Activation source sammyliu/qwen3-8b-activations-l20-l36
  • TopK SAE with d_sae=65536 and k=64
  • Hyperparameters and checkpoint policy listed above
  • Saved normalizer statistics in model.metadata.json
  • Retained-checkpoint selection logic described in this model card

The original live run was halted to preserve research artifacts before the GPU reservation ended. The last halted training state for this layer recorded step 74670, FVE 0.7188, and 36,645 dead features. The published checkpoint is earlier than the halted step because only retained exports were guaranteed to survive the rescue.

Limitations

  • This is a research artifact, not a production safety system.
  • The reverse index provides high-activation examples, not ground-truth feature labels.
  • Steering reliability must be evaluated per feature before using a feature in a demo or intervention.
  • The published checkpoint is the latest retained rescued export, not necessarily the final training step reached before shutdown.
Downloads last month

-

Downloads are not tracked for this model. How to track
Safetensors
Model size
0.5B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for sammyliu/qwen3-8b-sae-l36-topk64

Finetuned
Qwen/Qwen3-8B
Quantized
(286)
this model

Dataset used to train sammyliu/qwen3-8b-sae-l36-topk64