This is a gpt-4o-distil-Llama-3.3-70B-Instruct fine-tune, produced through P-E-W's Heretic (v1.2.0) abliteration engine with Magnitude-Preserving Orthogonal Ablation enabled.
Edit: This model is weird. Unable to refuse in standard prose, it writes python scripts with print statements or variables that refuse the request in non-standard ways. I have seen similar fallback safety mechanisms in various other models in the form of disclaimers or overt non-compliance, but writing code to refuse is wild. Still, it's pretty decensored. I cannot test it as much as I would like to due to lack of locally installed hardware capacity. Feedback is welcome.
Heretication Results
| Score Metric | Value | Parameter | Value |
|---|---|---|---|
| Refusals | 9/104 | direction_index | per layer |
| KL Divergence | 0.0347 | attn.o_proj.max_weight | 1.93 |
| Initial Refusals | 102/104 | attn.o_proj.max_weight_position | 17.32 |
| attn.o_proj.min_weight | 1.92 | ||
| attn.o_proj.min_weight_distance | 43.65 | ||
| mlp.down_proj.max_weight | 0.64 | ||
| mlp.down_proj.max_weight_position | 19.49 | ||
| mlp.down_proj.min_weight | 0.07 | ||
| mlp.down_proj.min_weight_distance | 78.30 |
Appendix
Empty system prompt.
» [Trial 75] Refusals: 9/104, KL divergence: 0.0347
[Trial 104] Refusals: 10/104, KL divergence: 0.0315
[Trial 150] Refusals: 13/104, KL divergence: 0.0276
[Trial 167] Refusals: 15/104, KL divergence: 0.0251
[Trial 109] Refusals: 20/104, KL divergence: 0.0232
[Trial 112] Refusals: 21/104, KL divergence: 0.0209
[Trial 97] Refusals: 25/104, KL divergence: 0.0196
[Trial 168] Refusals: 28/104, KL divergence: 0.0186
[Trial 87] Refusals: 30/104, KL divergence: 0.0178
[Trial 37] Refusals: 33/104, KL divergence: 0.0166
[Trial 153] Refusals: 36/104, KL divergence: 0.0161
[Trial 99] Refusals: 37/104, KL divergence: 0.0157
[Trial 189] Refusals: 38/104, KL divergence: 0.0157
[Trial 174] Refusals: 40/104, KL divergence: 0.0146
[Trial 196] Refusals: 41/104, KL divergence: 0.0145
[Trial 79] Refusals: 42/104, KL divergence: 0.0140
[Trial 128] Refusals: 43/104, KL divergence: 0.0137
[Trial 195] Refusals: 45/104, KL divergence: 0.0129
[Trial 172] Refusals: 49/104, KL divergence: 0.0119
[Trial 136] Refusals: 53/104, KL divergence: 0.0105
[Trial 25] Refusals: 54/104, KL divergence: 0.0097
[Trial 181] Refusals: 57/104, KL divergence: 0.0095
[Trial 54] Refusals: 62/104, KL divergence: 0.0082
[Trial 187] Refusals: 66/104, KL divergence: 0.0072
[Trial 35] Refusals: 70/104, KL divergence: 0.0055
[Trial 53] Refusals: 79/104, KL divergence: 0.0055
[Trial 13] Refusals: 81/104, KL divergence: 0.0049
[Trial 90] Refusals: 85/104, KL divergence: 0.0042
[Trial 21] Refusals: 88/104, KL divergence: 0.0037
[Trial 62] Refusals: 91/104, KL divergence: 0.0036
[Trial 148] Refusals: 92/104, KL divergence: 0.0031
[Trial 1] Refusals: 93/104, KL divergence: 0.0025
[Trial 119] Refusals: 94/104, KL divergence: 0.0025
[Trial 165] Refusals: 95/104, KL divergence: 0.0023
[Trial 156] Refusals: 96/104, KL divergence: 0.0023
[Trial 82] Refusals: 97/104, KL divergence: 0.0018
[Trial 48] Refusals: 98/104, KL divergence: 0.0014
[Trial 55] Refusals: 99/104, KL divergence: 0.0013
[Trial 22] Refusals: 100/104, KL divergence: 0.0013
[Trial 9] Refusals: 101/104, KL divergence: 0.0012
[Trial 117] Refusals: 102/104, KL divergence: 0.0011
Residual Geometry
┏━━━━━━━┳━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓
┃ Layer ┃ S(g,b) ┃ S(g*,b*) ┃ S(g,r) ┃ S(g*,r*) ┃ S(b,r) ┃ S(b*,r*) ┃ |g| ┃ |g*| ┃ |b| ┃ |b*| ┃ |r| ┃ |r*| ┃ Silh ┃
┡━━━━━━━╇━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩
│ 1 │ 0.9956 │ 0.9952 │ -0.0868 │ -0.0962 │ 0.0070 │ 0.0018 │ 0.45 │ 0.45 │ 0.45 │ 0.45 │ 0.04 │ 0.04 │ 0.1804 │
│ 2 │ 0.9945 │ 0.9941 │ -0.0484 │ -0.0548 │ 0.0561 │ 0.0540 │ 0.52 │ 0.52 │ 0.52 │ 0.52 │ 0.05 │ 0.06 │ 0.1761 │
│ 3 │ 0.9888 │ 0.9878 │ -0.1691 │ -0.1849 │ -0.0198 │ -0.0299 │ 0.61 │ 0.61 │ 0.60 │ 0.60 │ 0.09 │ 0.09 │ 0.1972 │
│ 4 │ 0.9894 │ 0.9889 │ -0.2148 │ -0.2280 │ -0.0708 │ -0.0808 │ 0.85 │ 0.86 │ 0.84 │ 0.84 │ 0.12 │ 0.13 │ 0.1600 │
│ 5 │ 0.9836 │ 0.9828 │ -0.2254 │ -0.2383 │ -0.0458 │ -0.0547 │ 0.95 │ 0.96 │ 0.93 │ 0.93 │ 0.17 │ 0.18 │ 0.1708 │
│ 6 │ 0.9768 │ 0.9757 │ -0.2380 │ -0.2522 │ -0.0246 │ -0.0342 │ 1.09 │ 1.09 │ 1.06 │ 1.06 │ 0.23 │ 0.24 │ 0.1796 │
│ 7 │ 0.9750 │ 0.9741 │ -0.2535 │ -0.2687 │ -0.0324 │ -0.0438 │ 1.24 │ 1.25 │ 1.20 │ 1.20 │ 0.28 │ 0.28 │ 0.1774 │
│ 8 │ 0.9674 │ 0.9668 │ -0.2454 │ -0.2598 │ 0.0082 │ -0.0042 │ 1.54 │ 1.55 │ 1.50 │ 1.50 │ 0.39 │ 0.40 │ 0.1783 │
│ 9 │ 0.9649 │ 0.9646 │ -0.2716 │ -0.2810 │ -0.0091 │ -0.0181 │ 1.83 │ 1.84 │ 1.76 │ 1.76 │ 0.48 │ 0.48 │ 0.1712 │
│ 10 │ 0.9397 │ 0.9395 │ -0.2178 │ -0.2246 │ 0.1292 │ 0.1228 │ 1.90 │ 1.91 │ 1.87 │ 1.87 │ 0.66 │ 0.66 │ 0.1671 │
│ 11 │ 0.9346 │ 0.9345 │ -0.1906 │ -0.1956 │ 0.1710 │ 0.1662 │ 2.08 │ 2.08 │ 2.07 │ 2.07 │ 0.75 │ 0.75 │ 0.1538 │
│ 12 │ 0.9291 │ 0.9290 │ -0.2163 │ -0.2151 │ 0.1600 │ 0.1616 │ 2.54 │ 2.54 │ 2.51 │ 2.51 │ 0.95 │ 0.95 │ 0.1456 │
│ 13 │ 0.9315 │ 0.9315 │ -0.1908 │ -0.1938 │ 0.1792 │ 0.1764 │ 2.84 │ 2.85 │ 2.83 │ 2.84 │ 1.05 │ 1.05 │ 0.1514 │
│ 14 │ 0.9039 │ 0.9038 │ -0.1919 │ -0.1945 │ 0.2465 │ 0.2441 │ 3.06 │ 3.07 │ 3.10 │ 3.10 │ 1.35 │ 1.35 │ 0.1714 │
│ 15 │ 0.8760 │ 0.8762 │ -0.1617 │ -0.1648 │ 0.3344 │ 0.3310 │ 3.38 │ 3.40 │ 3.54 │ 3.55 │ 1.73 │ 1.73 │ 0.1990 │
│ 16 │ 0.8565 │ 0.8572 │ -0.2078 │ -0.2087 │ 0.3268 │ 0.3248 │ 3.75 │ 3.76 │ 3.88 │ 3.88 │ 2.05 │ 2.05 │ 0.2183 │
│ 17 │ 0.8419 │ 0.8415 │ -0.2487 │ -0.2492 │ 0.3133 │ 0.3135 │ 3.96 │ 3.97 │ 4.04 │ 4.05 │ 2.25 │ 2.26 │ 0.2316 │
│ 18 │ 0.8634 │ 0.8628 │ -0.3343 │ -0.3404 │ 0.1868 │ 0.1817 │ 4.97 │ 4.99 │ 4.76 │ 4.78 │ 2.55 │ 2.57 │ 0.2290 │
│ 19 │ 0.8244 │ 0.8230 │ -0.2838 │ -0.2885 │ 0.3088 │ 0.3065 │ 5.12 │ 5.14 │ 5.17 │ 5.18 │ 3.05 │ 3.07 │ 0.2362 │
│ 20 │ 0.7963 │ 0.7931 │ -0.2662 │ -0.2732 │ 0.3712 │ 0.3692 │ 4.99 │ 5.02 │ 5.18 │ 5.20 │ 3.25 │ 3.29 │ 0.2442 │
│ 21 │ 0.8157 │ 0.8132 │ -0.2597 │ -0.2673 │ 0.3468 │ 0.3435 │ 5.58 │ 5.61 │ 5.75 │ 5.76 │ 3.44 │ 3.48 │ 0.2579 │
│ 22 │ 0.7708 │ 0.7684 │ -0.2226 │ -0.2269 │ 0.4495 │ 0.4490 │ 5.25 │ 5.28 │ 5.73 │ 5.75 │ 3.75 │ 3.78 │ 0.2665 │
│ 23 │ 0.7653 │ 0.7625 │ -0.2071 │ -0.2101 │ 0.4712 │ 0.4723 │ 5.04 │ 5.06 │ 5.59 │ 5.61 │ 3.68 │ 3.71 │ 0.2653 │
│ 24 │ 0.7129 │ 0.7084 │ -0.2047 │ -0.2051 │ 0.5405 │ 0.5455 │ 5.21 │ 5.22 │ 6.06 │ 6.10 │ 4.34 │ 4.40 │ 0.2743 │
│ 25 │ 0.6840 │ 0.6789 │ -0.2283 │ -0.2310 │ 0.5541 │ 0.5575 │ 5.14 │ 5.17 │ 6.01 │ 6.05 │ 4.51 │ 4.57 │ 0.2760 │
│ 26 │ 0.6983 │ 0.6936 │ -0.2291 │ -0.2335 │ 0.5368 │ 0.5385 │ 5.33 │ 5.38 │ 6.15 │ 6.21 │ 4.53 │ 4.60 │ 0.2658 │
│ 27 │ 0.6600 │ 0.6554 │ -0.2151 │ -0.2203 │ 0.5917 │ 0.5923 │ 5.70 │ 5.76 │ 6.91 │ 6.98 │ 5.31 │ 5.40 │ 0.2832 │
│ 28 │ 0.6256 │ 0.6202 │ -0.2287 │ -0.2341 │ 0.6164 │ 0.6174 │ 5.73 │ 5.79 │ 7.09 │ 7.16 │ 5.68 │ 5.78 │ 0.2949 │
│ 29 │ 0.6010 │ 0.5933 │ -0.2367 │ -0.2435 │ 0.6343 │ 0.6363 │ 5.67 │ 5.73 │ 7.12 │ 7.21 │ 5.86 │ 5.98 │ 0.3012 │
│ 30 │ 0.6007 │ 0.5929 │ -0.2400 │ -0.2461 │ 0.6319 │ 0.6346 │ 5.77 │ 5.83 │ 7.22 │ 7.32 │ 5.95 │ 6.08 │ 0.3010 │
│ 31 │ 0.5913 │ 0.5834 │ -0.2388 │ -0.2468 │ 0.6419 │ 0.6431 │ 5.79 │ 5.87 │ 7.34 │ 7.43 │ 6.09 │ 6.23 │ 0.3093 │
│ 32 │ 0.5965 │ 0.5880 │ -0.2360 │ -0.2451 │ 0.6391 │ 0.6400 │ 6.07 │ 6.16 │ 7.67 │ 7.77 │ 6.33 │ 6.48 │ 0.3104 │
│ 33 │ 0.5763 │ 0.5661 │ -0.2582 │ -0.2700 │ 0.6407 │ 0.6409 │ 6.42 │ 6.52 │ 8.08 │ 8.18 │ 6.83 │ 7.01 │ 0.3210 │
│ 34 │ 0.5836 │ 0.5731 │ -0.2253 │ -0.2382 │ 0.6597 │ 0.6594 │ 6.69 │ 6.80 │ 8.67 │ 8.79 │ 7.23 │ 7.41 │ 0.3249 │
│ 35 │ 0.5922 │ 0.5812 │ -0.2490 │ -0.2622 │ 0.6330 │ 0.6329 │ 6.70 │ 6.80 │ 8.38 │ 8.48 │ 6.98 │ 7.15 │ 0.3216 │
│ 36 │ 0.5817 │ 0.5718 │ -0.2454 │ -0.2573 │ 0.6458 │ 0.6456 │ 6.69 │ 6.78 │ 8.49 │ 8.58 │ 7.13 │ 7.29 │ 0.3126 │
│ 37 │ 0.5846 │ 0.5738 │ -0.2334 │ -0.2473 │ 0.6524 │ 0.6517 │ 6.80 │ 6.89 │ 8.72 │ 8.81 │ 7.28 │ 7.44 │ 0.3053 │
│ 38 │ 0.5685 │ 0.5594 │ -0.2415 │ -0.2524 │ 0.6610 │ 0.6609 │ 7.13 │ 7.21 │ 9.22 │ 9.30 │ 7.82 │ 7.96 │ 0.2977 │
│ 39 │ 0.5636 │ 0.5545 │ -0.2466 │ -0.2568 │ 0.6615 │ 0.6619 │ 7.25 │ 7.33 │ 9.37 │ 9.45 │ 7.98 │ 8.14 │ 0.2960 │
│ 40 │ 0.5695 │ 0.5601 │ -0.2585 │ -0.2692 │ 0.6469 │ 0.6470 │ 7.63 │ 7.72 │ 9.66 │ 9.75 │ 8.22 │ 8.38 │ 0.2882 │
│ 41 │ 0.5672 │ 0.5577 │ -0.2659 │ -0.2764 │ 0.6432 │ 0.6436 │ 7.78 │ 7.88 │ 9.80 │ 9.89 │ 8.37 │ 8.54 │ 0.2831 │
│ 42 │ 0.5502 │ 0.5409 │ -0.2422 │ -0.2517 │ 0.6769 │ 0.6779 │ 7.95 │ 8.05 │ 10.48 │ 10.59 │ 9.02 │ 9.21 │ 0.2859 │
│ 43 │ 0.5482 │ 0.5385 │ -0.2500 │ -0.2597 │ 0.6727 │ 0.6739 │ 8.17 │ 8.27 │ 10.69 │ 10.81 │ 9.24 │ 9.43 │ 0.2808 │
│ 44 │ 0.5465 │ 0.5369 │ -0.2424 │ -0.2521 │ 0.6800 │ 0.6810 │ 8.34 │ 8.44 │ 11.04 │ 11.16 │ 9.53 │ 9.73 │ 0.2757 │
│ 45 │ 0.5464 │ 0.5369 │ -0.2533 │ -0.2624 │ 0.6718 │ 0.6732 │ 8.77 │ 8.87 │ 11.46 │ 11.58 │ 9.92 │ 10.12 │ 0.2726 │
│ 46 │ 0.5373 │ 0.5280 │ -0.2464 │ -0.2544 │ 0.6850 │ 0.6870 │ 8.89 │ 8.99 │ 11.83 │ 11.96 │ 10.29 │ 10.50 │ 0.2713 │
│ 47 │ 0.5270 │ 0.5174 │ -0.2336 │ -0.2420 │ 0.7032 │ 0.7051 │ 8.95 │ 9.05 │ 12.24 │ 12.38 │ 10.70 │ 10.92 │ 0.2702 │
│ 48 │ 0.5299 │ 0.5201 │ -0.2219 │ -0.2300 │ 0.7093 │ 0.7116 │ 9.25 │ 9.34 │ 12.79 │ 12.94 │ 11.12 │ 11.36 │ 0.2670 │
│ 49 │ 0.5258 │ 0.5161 │ -0.2184 │ -0.2260 │ 0.7153 │ 0.7178 │ 9.54 │ 9.64 │ 13.33 │ 13.49 │ 11.62 │ 11.86 │ 0.2648 │
│ 50 │ 0.5274 │ 0.5176 │ -0.2106 │ -0.2179 │ 0.7195 │ 0.7223 │ 9.91 │ 10.00 │ 13.94 │ 14.12 │ 12.12 │ 12.38 │ 0.2643 │
│ 51 │ 0.5325 │ 0.5227 │ -0.2051 │ -0.2124 │ 0.7192 │ 0.7221 │ 10.17 │ 10.26 │ 14.32 │ 14.50 │ 12.38 │ 12.65 │ 0.2585 │
│ 52 │ 0.5319 │ 0.5218 │ -0.2064 │ -0.2133 │ 0.7188 │ 0.7221 │ 10.53 │ 10.63 │ 14.82 │ 15.01 │ 12.83 │ 13.11 │ 0.2566 │
│ 53 │ 0.5346 │ 0.5245 │ -0.2121 │ -0.2190 │ 0.7125 │ 0.7159 │ 11.03 │ 11.13 │ 15.37 │ 15.55 │ 13.29 │ 13.57 │ 0.2527 │
│ 54 │ 0.5351 │ 0.5248 │ -0.2101 │ -0.2172 │ 0.7135 │ 0.7169 │ 11.34 │ 11.45 │ 15.83 │ 16.03 │ 13.68 │ 13.98 │ 0.2499 │
│ 55 │ 0.5342 │ 0.5242 │ -0.2051 │ -0.2122 │ 0.7178 │ 0.7210 │ 11.56 │ 11.67 │ 16.26 │ 16.46 │ 14.04 │ 14.34 │ 0.2455 │
│ 56 │ 0.5426 │ 0.5326 │ -0.1891 │ -0.1963 │ 0.7223 │ 0.7254 │ 11.92 │ 12.03 │ 16.92 │ 17.13 │ 14.48 │ 14.79 │ 0.2429 │
│ 57 │ 0.5448 │ 0.5351 │ -0.1848 │ -0.1924 │ 0.7235 │ 0.7261 │ 12.37 │ 12.48 │ 17.61 │ 17.81 │ 15.03 │ 15.33 │ 0.2367 │
│ 58 │ 0.5455 │ 0.5360 │ -0.1822 │ -0.1895 │ 0.7247 │ 0.7274 │ 12.63 │ 12.75 │ 18.03 │ 18.24 │ 15.37 │ 15.68 │ 0.2317 │
│ 59 │ 0.5449 │ 0.5352 │ -0.1750 │ -0.1825 │ 0.7302 │ 0.7329 │ 12.86 │ 12.97 │ 18.53 │ 18.75 │ 15.78 │ 16.10 │ 0.2299 │
│ 60 │ 0.5415 │ 0.5318 │ -0.1675 │ -0.1748 │ 0.7381 │ 0.7408 │ 13.08 │ 13.20 │ 19.12 │ 19.35 │ 16.30 │ 16.64 │ 0.2296 │
│ 61 │ 0.5460 │ 0.5363 │ -0.1628 │ -0.1703 │ 0.7377 │ 0.7404 │ 13.53 │ 13.65 │ 19.77 │ 20.01 │ 16.79 │ 17.14 │ 0.2273 │
│ 62 │ 0.5450 │ 0.5357 │ -0.1571 │ -0.1642 │ 0.7424 │ 0.7450 │ 13.85 │ 13.97 │ 20.41 │ 20.65 │ 17.33 │ 17.68 │ 0.2251 │
│ 63 │ 0.5465 │ 0.5373 │ -0.1477 │ -0.1550 │ 0.7475 │ 0.7499 │ 14.06 │ 14.19 │ 20.94 │ 21.19 │ 17.73 │ 18.09 │ 0.2240 │
│ 64 │ 0.5478 │ 0.5384 │ -0.1454 │ -0.1526 │ 0.7481 │ 0.7506 │ 14.46 │ 14.58 │ 21.55 │ 21.81 │ 18.23 │ 18.60 │ 0.2233 │
│ 65 │ 0.5496 │ 0.5403 │ -0.1418 │ -0.1485 │ 0.7491 │ 0.7519 │ 14.74 │ 14.86 │ 22.03 │ 22.29 │ 18.59 │ 18.96 │ 0.2189 │
│ 66 │ 0.5515 │ 0.5425 │ -0.1358 │ -0.1423 │ 0.7515 │ 0.7544 │ 14.99 │ 15.10 │ 22.51 │ 22.77 │ 18.95 │ 19.33 │ 0.2166 │
│ 67 │ 0.5554 │ 0.5464 │ -0.1373 │ -0.1435 │ 0.7474 │ 0.7505 │ 15.51 │ 15.63 │ 23.13 │ 23.40 │ 19.42 │ 19.80 │ 0.2161 │
│ 68 │ 0.5617 │ 0.5526 │ -0.1339 │ -0.1404 │ 0.7447 │ 0.7476 │ 16.04 │ 16.16 │ 23.82 │ 24.10 │ 19.89 │ 20.28 │ 0.2147 │
│ 69 │ 0.5639 │ 0.5548 │ -0.1355 │ -0.1417 │ 0.7418 │ 0.7450 │ 16.50 │ 16.62 │ 24.38 │ 24.66 │ 20.32 │ 20.72 │ 0.2121 │
│ 70 │ 0.5743 │ 0.5655 │ -0.1462 │ -0.1513 │ 0.7259 │ 0.7297 │ 17.47 │ 17.58 │ 25.13 │ 25.41 │ 20.79 │ 21.20 │ 0.2111 │
│ 71 │ 0.5846 │ 0.5760 │ -0.1398 │ -0.1449 │ 0.7216 │ 0.7254 │ 18.06 │ 18.18 │ 25.84 │ 26.13 │ 21.17 │ 21.58 │ 0.2093 │
│ 72 │ 0.5948 │ 0.5865 │ -0.1361 │ -0.1411 │ 0.7155 │ 0.7191 │ 18.75 │ 18.87 │ 26.59 │ 26.88 │ 21.58 │ 21.99 │ 0.2060 │
│ 73 │ 0.5964 │ 0.5883 │ -0.1241 │ -0.1294 │ 0.7224 │ 0.7257 │ 19.39 │ 19.51 │ 27.83 │ 28.12 │ 22.51 │ 22.93 │ 0.2062 │
│ 74 │ 0.6186 │ 0.6110 │ -0.1062 │ -0.1108 │ 0.7156 │ 0.7191 │ 20.84 │ 20.95 │ 29.66 │ 29.96 │ 23.44 │ 23.87 │ 0.2041 │
│ 75 │ 0.6324 │ 0.6252 │ -0.0907 │ -0.0952 │ 0.7141 │ 0.7174 │ 22.56 │ 22.68 │ 32.10 │ 32.40 │ 24.97 │ 25.40 │ 0.2064 │
│ 76 │ 0.6282 │ 0.6214 │ -0.0907 │ -0.0947 │ 0.7179 │ 0.7211 │ 23.79 │ 23.90 │ 34.03 │ 34.34 │ 26.59 │ 27.03 │ 0.2008 │
│ 77 │ 0.6257 │ 0.6189 │ -0.0792 │ -0.0832 │ 0.7280 │ 0.7312 │ 25.52 │ 25.62 │ 37.11 │ 37.43 │ 29.04 │ 29.50 │ 0.2046 │
│ 78 │ 0.6483 │ 0.6419 │ -0.0529 │ -0.0569 │ 0.7260 │ 0.7290 │ 27.72 │ 27.82 │ 40.25 │ 40.57 │ 30.69 │ 31.16 │ 0.2007 │
│ 79 │ 0.6718 │ 0.6660 │ -0.0079 │ -0.0100 │ 0.7354 │ 0.7392 │ 32.04 │ 32.07 │ 47.28 │ 47.62 │ 35.02 │ 35.53 │ 0.1996 │
│ 80 │ 0.7264 │ 0.7224 │ -0.0470 │ -0.0586 │ 0.6524 │ 0.6480 │ 112.18 │ 113.71 │ 147.85 │ 149.04 │ 101.72 │ 103.24 │ 0.1946 │
└───────┴────────┴──────────┴─────────┴──────────┴─────────┴──────────┴────────┴────────┴────────┴────────┴────────┴────────┴────────┘
g = mean of residual vectors for good prompts
g* = geometric median of residual vectors for good prompts
b = mean of residual vectors for bad prompts
b* = geometric median of residual vectors for bad prompts
r = refusal direction for means (i.e., b - g)
r* = refusal direction for geometric medians (i.e., b* - g*)
S(x,y) = cosine similarity of x and y
|x| = L2 norm of x
Silh = Mean silhouette coefficient of residuals for good/bad clusters
Model Card for llama-3.3-70b-4o-final
This model is a fine-tuned version of meta-llama/Llama-3.3-70B-Instruct. It has been trained using TRL.
Quick start
from transformers import pipeline
question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="None", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])
Training procedure
This model was trained with SFT.
Framework versions
- PEFT 0.18.1
- TRL: 0.27.1
- Transformers: 5.0.0
- Pytorch: 2.9.0.dev20250708+cu128
- Datasets: 4.5.0
- Tokenizers: 0.22.2
Citations
Cite TRL as:
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
year = 2020,
journal = {GitHub repository},
publisher = {GitHub},
howpublished = {\url{https://github.com/huggingface/trl}}
}
- Downloads last month
- 65