This update introduces the second-generation merging algorithm and demonstrates another mode of YOYO-Fusion—the fixed anchor point mode.The sixth-generation model adopts an instruction model as the fixed anchor point.
Model Highlights:
merge method:
yoyo_fusion_v2precision:
dtype: bfloat16Context length:
262,144&1010000
Parameter Settings:
Temperature=0.7,TopP=0.8,TopK=20,MinP=0.
GitHub Repository:
Configuration:
The following configuration was used to produce this model:
from yoyo_fusion import run_merge
run_merge(
model_paths=[
"Qwen/Qwen3-30B-A3B-Instruct-2507", # anchor model
"Qwen/Qwen3-30B-A3B-Thinking-2507",
"Qwen/Qwen3-Coder-30B-A3B-Instruct",
"Qwen/Qwen3-30B-A3B-Base",
],
output_dir="YOYO-AI/Qwen3-30B-A3B-YOYO-V6",
anchor_index=1, # 0: robust center; n≥1: use n-th model as anchor
config_dir=1,
use_geometric_median=False,
use_matrix_boost=False, # apply Matrix Boost for linear/attention layers
sign_reference_mode=0, # 0: no alignment; n≥1: align signs to n-th model
use_z_median=True, # True: init z* with median; False: init with zeros
)
YOYO-Fusion-V2
Our second-generation merging algorithm incorporates the following improvements:
1. Sign Alignment
Let the reference model index be r = sign_reference_mode - 1, with flattened weight vector x_r ∈ R^D. For each model k ∈ {0, ..., K-1}, the aligned tensor x_{k,i} is defined element-wise as:
This ensures consistent sign orientation across models before fusion.
This step has no effect on the fixed anchor mode, but delivers substantial gains for the non-fixed anchor mode with robust center. We highly recommend enabling it when
anchor_index=0.
2. Subspace Rank Selection
Let R ∈ R^{K × D} be the residual matrix after centering, and let its SVD be R^T = U Σ V^T, with singular values σ_1 ≥ ... ≥ σ_r > 0, where r = min(K, D). Define eigenvalues λ_j = σ_j^2.
Before (Original):
After (Improved):
The new rank t_new is data-adaptive, reflecting the effective dimensionality of the residual subspace.
The Subspace Rank Selection no longer provides a toggle option, nor does it perform fixed-rank truncation once enabled. Instead, it is enabled by default and conducts adaptive truncation based on the distribution of ranks.
3. Energy Compensation Scaling Factor
Let E_total = sum_{j=1}^r σ_j^2 and E_retained = sum_{j=1}^t σ_j^2, where t is the chosen rank.
Before (Original):
After (Improved):
The calculation method for the scaling factor has been optimized, and it should now work more stably.
4. Initialization of z_star in Subspace
Let Z = R U_t ∈ R^{K × t} be the projection of residuals onto the top-t principal directions.
Before (Original):
The initial estimate for the robust location in subspace was effectively zero:
After (Improved):
The deviation used in Tukey weighting is then Δ = Z - z^{(0)}_new, enabling more robust outlier down-weighting.
It is recommended to enable
use_z_medianwhen using the fixed anchor point mode or when merging a large number of models, as it can improve the accuracy of the estimation.
5. Matrix Boost for Linear/Attention Layers
Let the reconstructed offset in original shape be R* ∈ R^{m × n} (e.g., weight matrix of a linear layer). Compute its SVD:
If σ_R is non-empty, define the boosted singular values as:
Then the boosted offset is:
This enforces isotropic amplification along all principal directions, preventing rank collapse in attention and MLP layers.
This is an experimental step derived from this paper. We have found that it is not only applicable to ordinary task vectors, but also generalizable to arbitrary residuals.
- Downloads last month
- 8