This update introduces the second-generation merging algorithm and demonstrates another mode of YOYO-Fusion—the fixed anchor point mode.The sixth-generation model adopts an instruction model as the fixed anchor point.

Model Highlights:

  • merge method: yoyo_fusion_v2

  • precision: dtype: bfloat16

  • Context length: 262,144&1010000

Parameter Settings:

Temperature=0.7, TopP=0.8, TopK=20,MinP=0.

GitHub Repository:

YOYO-Fusion

Configuration:

The following configuration was used to produce this model:

from yoyo_fusion import run_merge

run_merge(
    model_paths=[
        "Qwen/Qwen3-30B-A3B-Instruct-2507",  # anchor model
        "Qwen/Qwen3-30B-A3B-Thinking-2507",
        "Qwen/Qwen3-Coder-30B-A3B-Instruct",
        "Qwen/Qwen3-30B-A3B-Base",
    ],
    output_dir="YOYO-AI/Qwen3-30B-A3B-YOYO-V6",
    anchor_index=1,                # 0: robust center; n≥1: use n-th model as anchor
    config_dir=1,                  
    use_geometric_median=False,     
    use_matrix_boost=False,        # apply Matrix Boost for linear/attention layers
    sign_reference_mode=0,         # 0: no alignment; n≥1: align signs to n-th model
    use_z_median=True,             # True: init z* with median; False: init with zeros
)

YOYO-Fusion-V2

Our second-generation merging algorithm incorporates the following improvements:

1. Sign Alignment

Let the reference model index be r = sign_reference_mode - 1, with flattened weight vector x_r ∈ R^D. For each model k ∈ {0, ..., K-1}, the aligned tensor x_{k,i} is defined element-wise as:

x~k,i={xk,i,if sign(xr,i)0 and sign(xk,i)sign(xr,i)<0xk,i,otherwisei=1,,D \tilde{x}_{k,i} = \begin{cases} - x_{k,i}, & \text{if } \operatorname{sign}(x_{r,i}) \neq 0 \text{ and } \operatorname{sign}(x_{k,i}) \cdot \operatorname{sign}(x_{r,i}) < 0 \\ x_{k,i}, & \text{otherwise} \end{cases} \quad \forall i = 1,\dots,D

This ensures consistent sign orientation across models before fusion.

This step has no effect on the fixed anchor mode, but delivers substantial gains for the non-fixed anchor mode with robust center. We highly recommend enabling it when anchor_index=0.


2. Subspace Rank Selection

Let R ∈ R^{K × D} be the residual matrix after centering, and let its SVD be R^T = U Σ V^T, with singular values σ_1 ≥ ... ≥ σ_r > 0, where r = min(K, D). Define eigenvalues λ_j = σ_j^2.

Before (Original):

told={K1,if use_k_minus_one_truncation=TrueK,otherwise(clamped to r) t_{\text{old}} = \begin{cases} K - 1, & \text{if } \texttt{use\_k\_minus\_one\_truncation} = \texttt{True} \\ K, & \text{otherwise} \end{cases} \quad \text{(clamped to } \leq r\text{)}

After (Improved):

PR=(j=1rλj)2j=1rλj2,tnew=max(1, min(PR, K, r)) \mathrm{PR} = \frac{\left( \sum_{j=1}^{r} \lambda_j \right)^2}{\sum_{j=1}^{r} \lambda_j^2}, \quad t_{\text{new}} = \max\left(1,\ \min\left( \left\lfloor \mathrm{PR} \right\rceil,\ K,\ r \right) \right)

The new rank t_new is data-adaptive, reflecting the effective dimensionality of the residual subspace.

The Subspace Rank Selection no longer provides a toggle option, nor does it perform fixed-rank truncation once enabled. Instead, it is enabled by default and conducts adaptive truncation based on the distribution of ranks.


3. Energy Compensation Scaling Factor

Let E_total = sum_{j=1}^r σ_j^2 and E_retained = sum_{j=1}^t σ_j^2, where t is the chosen rank.

Before (Original):

αold=min(1p+ϵ, 10.0),where p=EretainedEtotal \alpha_{\text{old}} = \min\left( \frac{1}{p + \epsilon},\ 10.0 \right), \quad \text{where } p = \frac{E_{\text{retained}}}{E_{\text{total}}}

After (Improved):

αnew=min(EtotalEretained+ϵ, 10.0) \alpha_{\text{new}} = \min\left( \sqrt{ \frac{E_{\text{total}}}{E_{\text{retained}} + \epsilon} },\ 10.0 \right)

The calculation method for the scaling factor has been optimized, and it should now work more stably.


4. Initialization of z_star in Subspace

Let Z = R U_t ∈ R^{K × t} be the projection of residuals onto the top-t principal directions.

Before (Original):

The initial estimate for the robust location in subspace was effectively zero: zold(0)=0 \mathbf{z}^{(0)}_{\text{old}} = \mathbf{0}

After (Improved):

znew(0)={median(Z,dim=0),if use_z_median=True0,otherwise \mathbf{z}^{(0)}_{\text{new}} = \begin{cases} \operatorname{median}(\mathbf{Z}, \text{dim}=0), & \text{if } \texttt{use\_z\_median} = \texttt{True} \\ \mathbf{0}, & \text{otherwise} \end{cases}

The deviation used in Tukey weighting is then Δ = Z - z^{(0)}_new, enabling more robust outlier down-weighting.

It is recommended to enable use_z_median when using the fixed anchor point mode or when merging a large number of models, as it can improve the accuracy of the estimation.


5. Matrix Boost for Linear/Attention Layers

Let the reconstructed offset in original shape be R* ∈ R^{m × n} (e.g., weight matrix of a linear layer). Compute its SVD:

R=URdiag(σR)VR,σR=(σ1R,,σmin(m,n)R) \mathbf{R}^* = \mathbf{U}_R \operatorname{diag}(\boldsymbol{\sigma}_R) \mathbf{V}_R^\top, \quad \boldsymbol{\sigma}_R = (\sigma_1^R, \dots, \sigma_{\min(m,n)}^R)

If σ_R is non-empty, define the boosted singular values as:

σ~R=(σmax,σmax,,σmax),where σmax=σ1R \tilde{\boldsymbol{\sigma}}_R = (\sigma_{\max}, \sigma_{\max}, \dots, \sigma_{\max}), \quad \text{where } \sigma_{\max} = \sigma_1^R

Then the boosted offset is:

R~=URdiag(σ~R)VR \widetilde{\mathbf{R}}^* = \mathbf{U}_R \operatorname{diag}(\tilde{\boldsymbol{\sigma}}_R) \mathbf{V}_R^\top

This enforces isotropic amplification along all principal directions, preventing rank collapse in attention and MLP layers.

This is an experimental step derived from this paper. We have found that it is not only applicable to ordinary task vectors, but also generalizable to arbitrary residuals.

Downloads last month
8
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for YOYO-AI/Qwen3-30B-A3B-YOYO-V6

Collection including YOYO-AI/Qwen3-30B-A3B-YOYO-V6

Paper for YOYO-AI/Qwen3-30B-A3B-YOYO-V6