GoldDiamondGold-Paperbliteration-L33-70b

image

This is a targeted abliteration of KaraKaraWitch/GoldDiamondGold-L33-70b.

Methodology

Previous abliteration attempts on this model resulted in regressions on the UGI Leaderboard. Specifically, the NatInt (Natural Intelligence), Textbook, and World Model scores were significantly reduced.

We suspect this degradation occurs because the "refusal" vectors in Llama-3.3 are heavily entangled with factual knowledge and reasoning capabilities located in the MLP layers. When the MLP is ablated to remove refusals, "Textbook" knowledge is lost as collateral damage.

This version ("Paperbliteration") uses a constrained optimization strategy via a Custom Heretic aimed at mitigating this issue:

  1. MLP Preservation: The optimization was constrained to effectively ignore MLP layers (down_proj weights < 0.05) to preserve knowledge and reasoning capabilities.
  2. Attention Targeting: Refusal removal was offloaded to the Attention layers (o_proj), with weights forced between 1.0 and 2.0.
  3. Winsorization: Applied at the 0.95 quantile to mitigate the impact of Llama-3's massive activation outliers on vector calculation.

Heretic Parameters (Trial 164)

Parameter Value Note
direction_index 40.37 Mid-stack intervention
attn.o_proj.max_weight 1.99 High Attention Ablation
attn.o_proj.max_weight_position 50.92
attn.o_proj.min_weight 1.96
attn.o_proj.min_weight_distance 44.69
mlp.down_proj.max_weight 0.04 Knowledge Preservation (Near Zero)
mlp.down_proj.max_weight_position 50.87
mlp.down_proj.min_weight 0.04
mlp.down_proj.min_weight_distance 26.10

Reproducibility

Currently, constraits are not part of standard heretic. You will need this PR here.

Command Used:

heretic --model KaraKaraWitch/GoldDiamondGold-L33-70b \
 --orthogonalize-direction \
 --row-normalization FULL \
 --winsorization-quantile 0.95 \
 --constraints.layer-end-fraction 0.75 \
 --constraints.mlp.max-weight-min 0.0 \
 --constraints.mlp.max-weight-max 0.05 \
 --constraints.attention.max-weight-min 1.0 \
 --constraints.attention.max-weight-max 2.0 \
 --n-trials 200 \
 --batch-size 128 #  Not strictly needed

Evaluation

Metric This Model Standard Abliteration Original Model
KL Divergence 0.0055 ~0.0139 0
Refusals 12/100 ~9/100 94/100
  • KL Divergence: 0.0055 indicates extremely low deviation from the base model's weights, suggesting high preservation of the original model's "Textbook" capabilities.
  • Trade-off: This method accepts a slightly higher refusal rate (+3/100 compared to unconstrained abliteration) in exchange for structural and semantic integrity.
Downloads last month
171
Safetensors
Model size
71B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for KaraKaraWitch/Golddiamondgold-Paperbliteration-L33-70b

Finetuned
(2)
this model
Quantizations
3 models