Spaces:

ASLP-lab
/

YingMusic-Singer

Running on Zero

App Files Files Community

xjsc0 commited on 17 days ago

Commit

ffbb4ab

1 Parent(s): 4566e34

11

Browse files

Files changed (10) hide show

LICENSE +25 -0
LICENSE-STABILITY +57 -0
README.md +177 -30
app_local.py +640 -0
assets/YingMusic-Singer.drawio.svg +3 -0
assets/results.png +3 -0
assets/wechat_qr.png +3 -0
infer_api.py +216 -0
inference_mp.py +324 -0
inference_mp.sh +41 -0

LICENSE ADDED Viewed

	@@ -0,0 +1,25 @@

+Creative Commons Attribution 4.0 International Public License
+By exercising the Licensed Rights (defined below), You accept and agree
+to be bound by the terms and conditions of this Creative Commons
+Attribution 4.0 International Public License ("Public License").
+To the extent this Public License may be interpreted as a contract,
+You are granted the Licensed Rights in consideration of Your acceptance
+of these terms and conditions, and the Licensor grants You such rights
+in consideration of benefits the Licensor receives from making
+the Licensed Material available under these terms and conditions.
+You are free to:
+- Share — copy and redistribute the material in any medium or format
+- Adapt — remix, transform, and build upon the material for any purpose, even commercially.
+Under the following terms:
+- Attribution — You must give appropriate credit, provide a link to the license,
+  and indicate if changes were made. You may do so in any reasonable manner,
+  but not in any way that suggests the licensor endorses you or your use.
+No additional restrictions — You may not apply legal terms or
+technological measures that legally restrict others from doing
+anything the license permits.
+Full license text: https://creativecommons.org/licenses/by/4.0/legalcode

LICENSE-STABILITY ADDED Viewed

	@@ -0,0 +1,57 @@

+STABILITY AI COMMUNITY LICENSE AGREEMENT
+Last Updated: July 5, 2024
+1. INTRODUCTION
+This Agreement applies to any individual person or entity (“You”, “Your” or “Licensee”) that uses or distributes any portion or element of the Stability AI Materials  or Derivative Works thereof for any Research & Non-Commercial or Commercial purpose. Capitalized terms not otherwise defined herein are defined in Section V below.
+This Agreement is intended to allow research, non-commercial, and limited commercial uses of the Models free of charge. In order to ensure that certain limited commercial uses of the Models continue to be allowed, this Agreement  preserves free access to the Models for people or organizations  generating annual revenue of less than US $1,000,000 (or local currency equivalent).
+By clicking “I Accept”  or by using or distributing or using any portion or element of the Stability Materials or Derivative Works, You agree that You have read, understood and are bound by the terms of this Agreement. If You are acting on behalf of a company, organization or other entity, then “You” includes you and that entity, and You agree that You: (i) are an authorized representative of such entity with the authority to bind such entity to this Agreement, and (ii) You agree to the terms of this Agreement on that entity’s behalf.
+2. RESEARCH & NON-COMMERCIAL USE LICENSE
+Subject to the terms of this Agreement, Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Research or Non-Commercial Purpose. “Research Purpose” means academic or scientific advancement, and in each case, is not primarily intended for commercial advantage or monetary compensation to You or others. “Non-Commercial Purpose” means any purpose other than a Research Purpose that is not primarily intended for commercial advantage or monetary compensation to You or others, such as personal use (i.e., hobbyist) or evaluation and testing.
+3. COMMERCIAL USE LICENSE
+Subject to the terms of this Agreement (including the remainder of this Section III), Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Commercial Purpose. “Commercial Purpose” means any purpose other than a Research Purpose or Non-Commercial Purpose that is primarily intended for commercial advantage or monetary compensation to You or others, including but not limited to, (i) creating, modifying, or distributing Your product or service, including via a hosted service or application programming interface, and (ii) for Your business’s or organization’s internal operations.
+If You are using or distributing the Stability AI Materials for a Commercial Purpose, You must register with Stability AI at (https://stability.ai/community-license). If at any time You or Your Affiliate(s), either individually or in aggregate, generate more than USD $1,000,000 in annual revenue (or the equivalent thereof in Your local currency), regardless of whether that revenue is generated directly or indirectly from the Stability AI Materials or Derivative Works, any licenses granted to You under this Agreement shall terminate as of such date. You must request a license from Stability AI at (https://stability.ai/enterprise) , which Stability AI may grant to You in its sole discretion. If you receive Stability AI Materials, or any Derivative Works thereof, from a Licensee as part of an integrated end user product, then Section III of this Agreement will not apply to you.
+4. GENERAL TERMS
+Your Research, Non-Commercial, and Commercial License(s) under this Agreement are subject to the following terms.
+a.  Distribution & Attribution. If You distribute or make available the Stability AI Materials or a Derivative Work to a third party, or a product or service that uses any portion of them, You shall: (i) provide a copy of this Agreement to that third party, (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "This Stability AI Model is licensed under the Stability AI Community License, Copyright ©  Stability AI Ltd. All Rights Reserved”, and (iii) prominently display “Powered by Stability AI” on a related website, user interface, blogpost, about page, or product documentation.  If You create a Derivative Work, You may add your own attribution notice(s) to the “Notice” text file included with that Derivative Work, provided that You clearly indicate which attributions apply to the Stability AI Materials and state in the “Notice” text file that You changed the Stability AI Materials and how it was modified.
+b.  Use Restrictions. Your use of the Stability AI Materials and Derivative Works, including any output or results of the Stability AI Materials or Derivative Works, must comply with applicable laws and regulations (including Trade Control Laws and equivalent regulations) and adhere to the Documentation and Stability AI’s AUP, which is hereby incorporated by reference. Furthermore, You will not use the Stability AI Materials or Derivative Works, or any output or results of the Stability AI Materials or Derivative Works, to create or improve any foundational generative AI model (excluding the Models or Derivative Works).
+c.  Intellectual Property.
+(i) Trademark License.  No trademark licenses are granted under this Agreement, and in connection with the Stability AI Materials or Derivative Works, You may not use any name or mark owned by or associated with Stability AI or any of its Affiliates, except as required under Section IV(a) herein.
+(ii)  Ownership of Derivative Works.  As between You and Stability AI, You are the owner of Derivative Works You create, subject to Stability AI’s ownership of the Stability AI Materials and any Derivative Works made by or for Stability AI.
+(iii)  Ownership of Outputs. As between You and Stability AI, You own any outputs generated from the Models or Derivative Works to the extent permitted by applicable law.
+(iv)  Disputes.  If You or Your Affiliate(s) institute litigation or other proceedings against Stability AI (including a cross-claim or counterclaim in a lawsuit) alleging that the Stability AI Materials, Derivative Works or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by You, then any licenses granted to You under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Stability AI from and against any claim by any third party arising out of or related to Your use or distribution of the Stability AI Materials or Derivative Works in violation of this Agreement.
+(v)  Feedback.  From time to time, You may provide Stability AI with verbal and/or written suggestions, comments or other feedback related to Stability AI’s existing or prospective technology, products or services (collectively, “Feedback”). You are not obligated to provide Stability AI with Feedback, but to the extent that You do, You hereby grant Stability AI a perpetual, irrevocable, royalty-free, fully-paid, sub-licensable, transferable, non-exclusive, worldwide right and license to exploit the Feedback in any manner without restriction. Your Feedback is provided “AS IS” and You make no warranties whatsoever about any Feedback.
+d.  Disclaimer Of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE STABILITY AI MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OR LAWFULNESS OF USING OR REDISTRIBUTING THE STABILITY AI MATERIALS, DERIVATIVE WORKS OR ANY OUTPUT OR RESULTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE STABILITY AI MATERIALS, DERIVATIVE WORKS AND ANY OUTPUT AND RESULTS.
+e.  Limitation Of Liability. IN NO EVENT WILL STABILITY AI OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF STABILITY AI OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
+f.  Term And Termination. The term of this Agreement will commence upon Your acceptance of this Agreement or access to the Stability AI Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Stability AI may terminate this Agreement if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You shall delete and cease use of any Stability AI Materials or Derivative Works. Section IV(d), (e), and (g) shall survive the termination of this Agreement.
+g.  Governing Law.  This Agreement will be governed by and constructed in accordance with the laws of the United States and the State of California without regard to choice of law principles, and the UN Convention on Contracts for International Sale of Goods does not apply to this Agreement.
+5. DEFINITIONS
+“Affiliate(s)” means any entity that directly or indirectly controls, is controlled by, or is under common control with the subject entity; for purposes of this definition, “control” means direct or indirect ownership or control of more than 50% of the voting interests of the subject entity.
+"Agreement" means this Stability AI Community License Agreement.
+“AUP” means the Stability AI Acceptable Use Policy available at (https://stability.ai/use-policy), as may be updated from time to time.
+"Derivative Work(s)” means (a) any derivative work of the Stability AI Materials as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Model’s output, including “fine tune” and “low-rank adaptation” models derived from a Model or a Model’s output, but do not include the output of any Model.
+“Documentation” means any specifications, manuals, documentation, and other written information provided by Stability AI related to the Software or Models.
+“Model(s)" means, collectively, Stability AI’s proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing listed on Stability’s Core Models Webpage available at (https://stability.ai/core-models), as may be updated from time to time.
+"Stability AI" or "we" means Stability AI Ltd. and its Affiliates.
+"Software" means Stability AI’s proprietary software made available under this Agreement now or in the future.
+“Stability AI Materials” means, collectively, Stability’s proprietary Models, Software and Documentation (and any portion or combination thereof) made available under this Agreement.
+“Trade Control Laws” means any applicable U.S. and non-U.S. export control and trade sanctions laws and regulations.

README.md CHANGED Viewed

@@ -15,75 +15,222 @@ short_description: Edit lyrics, keep the melody
 fullWidth: true
 ---
-# YingMusic-Singer
-YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
-## Environment Setup
-### 1. Install from Scratch
 ```bash
 conda create -n YingMusic-Singer python=3.10
 conda activate YingMusic-Singer
-# uv is much quicker
 pip install uv
 uv pip install -r requirements.txt
 ```
-### 2. Pre-built Conda Environment for One-Click Deployment (Nvidia / AMD CPU Only)
-Coming soon
-## 推理
-### 使用huggingface Space（线上体验）
-访问https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer之后，就可以快速体验
-### 使用Docker运行
 docker build -t yingmusic-singer .
-### 使用python运行
-git clone
-cd
-python initialization.py --task infer
-# for Gradio
-python app.py
-# 多进程 Inference
-# 1. 你需要确保所有输入模型的均为分离之后的纯人声，如果没有分离，可以参考/src/third_party/MusicSourceSeparationTraining/inference_api.py 进行分离
-# 2. jsonl 文件的格式为，每行一个json，{}
 python batch_infer.py \
     --input_type jsonl \
     --input_path /path/to/input.jsonl \
     --output_dir /path/to/output \
     --ckpt_path /path/to/ckpts \
     --num_gpus 4
-# 多进程 Inference(LyricEditBench melody control)
 python inference_mp.py \
     --input_type lyric_edit_bench_melody_control \
-    --output_dir path/to/ \
-    LyricEditBench_melody_control \
     --ckpt_path ASLP-lab/YingMusic-Singer \
     --num_gpus 8
-# 多进程 Inference(LyricEditBench sing edit)
 python inference_mp.py \
     --input_type lyric_edit_bench_sing_edit \
-    --output_dir path/to/ \
-    LyricEditBench_melody_control \
     --ckpt_path ASLP-lab/YingMusic-Singer \
     --num_gpus 8
-## License
-The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), except for the following components:
-The VAE model weights and inference code (in `src/YingMusic-Singer/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).

 fullWidth: true
 ---
+<div align="center">
+<h1>🎤 YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>
+<p>
+  <a href="">English</a> ｜ <a href="README_ZH.md">中文</a>
+</p>
+![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
+![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)
+[![arXiv Paper](https://img.shields.io/badge/arXiv-0.0-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/0.0)
+[![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer)
+[![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer)
+[![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer)
+[![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
+[![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
+[![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer/blob/main/assets/wechat_qr.png)
+[![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)
+<p>
+<a href="https://orcid.org/0009-0005-5957-8936"><b>Chunbo Hao</b></a>¹² ·
+<a href="https://orcid.org/0009-0003-2602-2910"><b>Junjie Zheng</b></a>² ·
+<a href="https://orcid.org/0009-0001-6706-0572"><b>Guobin Ma</b></a>¹ ·
+<b>Yuepeng Jiang</b>¹ ·
+<b>Huakang Chen</b>¹ ·
+<b>Wenjie Tian</b>¹ ·
+<a href="https://orcid.org/0009-0003-9258-4006"><b>Gongyu Chen</b></a>² ·
+<a href="https://orcid.org/0009-0005-5413-6725"><b>Zihao Chen</b></a>² ·
+<b>Lei Xie</b>¹
+</p>
+<p>
+<sup>1</sup> Northwestern Polytechnical University · <sup>2</sup> Giant Network
+</p>
+</div>
+<div align="center">
+<img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
+<p><i>Overall architecture of YingMusic-Singer. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
+</div>
+## 📖 Introduction
+**YingMusic-Singer** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.
+Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.
+## ✨ Key Features
+- **Annotation-free**: No manual lyric-MIDI alignment required at inference
+- **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
+- **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
+- **Bilingual**: Unified IPA tokenizer for both Chinese and English
+- **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE
+## 🚀 Quick Start
+### Option 1: Install from Scratch
 ```bash
 conda create -n YingMusic-Singer python=3.10
 conda activate YingMusic-Singer
+# uv is much faster than pip
 pip install uv
 uv pip install -r requirements.txt
 ```
+### Option 2: Pre-built Conda Environment
+1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
+2. Download the pre-built environment package for your setup from the table below.
+3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer`.
+4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.
+| CPU Architecture | GPU    | OS      | Download |
+|------------------|--------|---------|----------|
+| ARM              | NVIDIA | Linux   | Coming soon |
+| AMD64            | NVIDIA | Linux   | Coming soon |
+| AMD64            | NVIDIA | Windows | Coming soon |
+### Option 3: Docker
+Build the image:
+```bash
 docker build -t yingmusic-singer .
+```
+Run inference:
+```bash
+docker run --gpus all -it yingmusic-singer
+```
+## 🎵 Inference
+### Option 1: Online Demo (HuggingFace Space)
+Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer to try the model instantly in your browser.
+### Option 2: Local Gradio App (same as online demo)
+```bash
+python app_local.py
+```
+### Option 3: Command-line Inference
+```bash
+python infer_api.py \
+    --ref_audio path/to/ref.wav \
+    --melody_audio path/to/melody.wav \
+    --ref_text "该体谅的不执着|如果那天我" \
+    --target_text "好多天|看不完你" \
+    --output output.wav
+```
+Enable vocal separation and accompaniment mixing:
+```bash
+python infer_api.py \
+    --ref_audio ref.wav \
+    --melody_audio melody.wav \
+    --ref_text "..." \
+    --target_text "..." \
+    --separate_vocals \      # separate vocals from the input before processing
+    --mix_accompaniment \    # mix the synthesized vocal back with the accompaniment
+    --output mixed_output.wav
+```
+### Option 4: Batch Inference
+> **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.
+The input JSONL file should contain one JSON object per line, formatted as follows:
+```json
+{"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
+```
+```bash
 python batch_infer.py \
     --input_type jsonl \
     --input_path /path/to/input.jsonl \
     --output_dir /path/to/output \
     --ckpt_path /path/to/ckpts \
     --num_gpus 4
+```
+Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:
+```bash
 python inference_mp.py \
     --input_type lyric_edit_bench_melody_control \
+    --output_dir path/to/LyricEditBench_melody_control \
     --ckpt_path ASLP-lab/YingMusic-Singer \
     --num_gpus 8
+```
+Multi-process inference on **LyricEditBench (singing edit)**:
+```bash
 python inference_mp.py \
     --input_type lyric_edit_bench_sing_edit \
+    --output_dir path/to/LyricEditBench_sing_edit \
     --ckpt_path ASLP-lab/YingMusic-Singer \
     --num_gpus 8
+```
+## 🏗️ Model Architecture
+YingMusic-Singer consists of four core components:
+| Component | Description |
+|-----------|-------------|
+| **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
+| **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
+| **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
+| **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |
+**Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)
+## 📊 LyricEditBench
+We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.
+### Results
+<div align="center">
+<p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
+<img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
+</div>
+## 🙏 Acknowledgements
+This work builds upon the following open-source projects:
+- [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
+- [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
+- [SOME](https://github.com/openvpi/SOME) — Melody Extractor
+- [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
+- [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
+- [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data
+## 📄 License
+The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:
+The VAE model weights and inference code (in `src/YingMusic-Singer/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).

app_local.py ADDED Viewed

	@@ -0,0 +1,640 @@

+"""
+YingMusic Singer - Gradio Web Interface
+========================================
+基于参考音色与旋律音频的歌声合成系统，支持自动分离人声与伴奏。
+A singing voice synthesis system powered by YingMusicSinger,
+with built-in vocal/accompaniment separation via MelBandRoformer.
+"""
+import os
+import tempfile
+import gradio as gr
+import torch
+import torchaudio
+from initialization import download_files
+IS_HF_SPACE = os.environ.get("SPACE_ID") is not None
+HF_ENABLE = False
+LOCAL_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+def local_move2gpu(x):
+    """Move models to GPU on local environment. No-op on HuggingFace Spaces (ZeroGPU handles it)."""
+    if IS_HF_SPACE:
+        return x
+    return x.to(LOCAL_DEVICE)
+# ---------------------------------------------------------------------------
+# Model loading (lazy, singleton) / 模型懒加载（单例）
+# ---------------------------------------------------------------------------
+_model = None
+_separator = None
+def _load_model_impl():
+    """Internal: load YingMusicSinger (no GPU decorator, called inside GPU context)."""
+    download_files(task="infer")
+    global _model
+    if _model is None:
+        from src.YingMusicSinger.infer.YingMusicSinger import YingMusicSinger
+        _model = YingMusicSinger.from_pretrained("ASLP-lab/YingMusic-Singer")
+        _model = local_move2gpu(_model)
+        _model.eval()
+    return _model
+def _load_separator_impl():
+    """Internal: load MelBandRoformer separator (no GPU decorator, called inside GPU context)."""
+    download_files(task="infer")
+    global _separator
+    if _separator is None:
+        from src.third_party.MusicSourceSeparationTraining.inference_api import Separator
+        _separator = Separator(
+            config_path="ckpts/config_vocals_mel_band_roformer_kj.yaml",
+            checkpoint_path="ckpts/MelBandRoformer.ckpt",
+        )
+    return _separator
+# ---------------------------------------------------------------------------
+# Vocal separation utilities / 人声分离工具
+# ---------------------------------------------------------------------------
+def _separate_vocals_impl(audio_path: str) -> tuple:
+    """
+    Separate audio into vocals and accompaniment using MelBandRoformer.
+    Must be called within an active GPU context.
+    """
+    separator = _load_separator_impl()
+    wav, sr = torchaudio.load(audio_path)
+    vocal_wav, inst_wav, out_sr = separator.separate(wav, sr)
+    tmp_dir = tempfile.mkdtemp()
+    vocals_path = os.path.join(tmp_dir, "vocals.wav")
+    accomp_path = os.path.join(tmp_dir, "accompaniment.wav")
+    torchaudio.save(vocals_path, torch.from_numpy(vocal_wav), out_sr)
+    torchaudio.save(accomp_path, torch.from_numpy(inst_wav), out_sr)
+    return vocals_path, accomp_path
+def mix_vocal_and_accompaniment(
+    vocal_path: str,
+    accomp_path: str,
+    vocal_gain: float = 1.0,
+) -> str:
+    """
+    将合成人声与伴奏混合为最终音频。
+    Mix synthesised vocals with accompaniment into a final audio file.
+    """
+    vocal_wav, vocal_sr = torchaudio.load(vocal_path)
+    accomp_wav, accomp_sr = torchaudio.load(accomp_path)
+    if accomp_sr != vocal_sr:
+        accomp_wav = torchaudio.functional.resample(accomp_wav, accomp_sr, vocal_sr)
+    if vocal_wav.shape[0] != accomp_wav.shape[0]:
+        if vocal_wav.shape[0] == 1:
+            vocal_wav = vocal_wav.expand(accomp_wav.shape[0], -1)
+        else:
+            accomp_wav = accomp_wav.expand(vocal_wav.shape[0], -1)
+    min_len = min(vocal_wav.shape[1], accomp_wav.shape[1])
+    vocal_wav = vocal_wav[:, :min_len]
+    accomp_wav = accomp_wav[:, :min_len]
+    mixed = vocal_wav * vocal_gain + accomp_wav
+    peak = mixed.abs().max()
+    if peak > 1.0:
+        mixed = mixed / peak
+    out_path = os.path.join(tempfile.mkdtemp(), "mixed_output.wav")
+    torchaudio.save(out_path, mixed, sample_rate=vocal_sr)
+    return out_path
+# ---------------------------------------------------------------------------
+# Inference wrapper / 推理入口
+# Single @spaces.GPU scope covers ALL heavy work (separation + synthesis)
+# so models stay resident in GPU memory across steps within one call.
+# ---------------------------------------------------------------------------
+def synthesize(
+    ref_audio,
+    melody_audio,
+    ref_text,
+    target_text,
+    separate_vocals_flag,
+    mix_accompaniment_flag,
+    sil_len_to_end,
+    t_shift,
+    nfe_step,
+    cfg_strength,
+    seed,
+):
+    """
+    主合成流程 / Main synthesis pipeline.
+    1. (可选) 用 MelBandRoformer 分离参考音频和旋律音频的人声与伴奏
+    2. 送入 YingMusicSinger 合成
+    3. (可选) 将合成人声与旋律音频的伴奏混合
+    """
+    import random
+    # ---- 输入校验 / Input validation ----------------------------------------
+    if ref_audio is None:
+        raise gr.Error("请上传参考音频 / Please upload Reference Audio")
+    if melody_audio is None:
+        raise gr.Error("请上传旋律音频 / Please upload Melody Audio")
+    if not ref_text.strip():
+        raise gr.Error("请输入参考音频对应的歌词 / Please enter Reference Text")
+    if not target_text.strip():
+        raise gr.Error("请输入目标合成歌词 / Please enter Target Text")
+    ref_audio_path = ref_audio if isinstance(ref_audio, str) else ref_audio[0]
+    melody_audio_path = (
+        melody_audio if isinstance(melody_audio, str) else melody_audio[0]
+    )
+    actual_seed = int(seed)
+    if actual_seed < 0:
+        actual_seed = random.randint(0, 2**31 - 1)
+    # ---- Step 1: 人声分离（合并在同一 GPU 上下文中）/ Vocal separation (same GPU context) ----
+    melody_accomp_path = None
+    actual_ref_path = ref_audio_path
+    actual_melody_path = melody_audio_path
+    if separate_vocals_flag:
+        ref_vocals_path, _ = _separate_vocals_impl(ref_audio_path)
+        actual_ref_path = ref_vocals_path
+        melody_vocals_path, melody_accomp_path = _separate_vocals_impl(melody_audio_path)
+        actual_melody_path = melody_vocals_path
+    # ---- Step 2: 模型推理 / Model inference (same GPU context) ---------------
+    model = _load_model_impl()
+    audio_tensor, sr = model(
+        ref_audio_path=actual_ref_path,
+        melody_audio_path=actual_melody_path,
+        ref_text=ref_text.strip(),
+        target_text=target_text.strip(),
+        lrc_align_mode="sentence_level",
+        sil_len_to_end=float(sil_len_to_end),
+        t_shift=float(t_shift),
+        nfe_step=int(nfe_step),
+        cfg_strength=float(cfg_strength),
+        seed=actual_seed,
+    )
+    vocal_out_path = os.path.join(tempfile.mkdtemp(), "vocal_output.wav")
+    torchaudio.save(vocal_out_path, audio_tensor.to("cpu"), sample_rate=sr)
+    # ---- Step 3: 混合伴奏 / Mix accompaniment (optional) ---------------------
+    if (
+        separate_vocals_flag
+        and mix_accompaniment_flag
+        and melody_accomp_path is not None
+    ):
+        final_path = mix_vocal_and_accompaniment(vocal_out_path, melody_accomp_path)
+        return final_path
+    else:
+        return vocal_out_path
+# ---------------------------------------------------------------------------
+# Example presets / 预设示例
+# ---------------------------------------------------------------------------
+EXAMPLES_MELODY_CONTROL = [
+    # [ref_audio, melody_audio, ref_text, target_text, sep, mix, sil, t_shift, nfe, cfg, seed]
+    [
+        "examples/melody_control/ref_01.wav",
+        "examples/melody_control/melody_01.wav",
+        "该体谅的不执着|如果那天我",
+        "好多天|看不完你",
+        True, False, 0.5, 0.5, 32, 3.0, -1,
+    ],
+    [
+        "examples/melody_control/ref_02.wav",
+        "examples/melody_control/melody_02.wav",
+        "月光下的身影|渐渐模糊",
+        "星光照亮前路|指引方向",
+        True, False, 0.5, 0.5, 32, 3.0, -1,
+    ],
+]
+EXAMPLES_LYRIC_EDIT = [
+    [
+        "examples/lyric_edit/ref_01.wav",
+        "examples/lyric_edit/melody_01.wav",
+        "该体谅的不执着|如果那天我",
+        "忘不掉的笑容|留在心里面",
+        True, False, 0.5, 0.5, 32, 3.0, -1,
+    ],
+    [
+        "examples/lyric_edit/ref_02.wav",
+        "examples/lyric_edit/melody_02.wav",
+        "夜深了还不睡|想着你的脸",
+        "春风又吹过来|带走我思念",
+        True, False, 0.5, 0.5, 32, 3.0, -1,
+    ],
+]
+# ---------------------------------------------------------------------------
+# Custom CSS / 自定义样式
+# ---------------------------------------------------------------------------
+CUSTOM_CSS = """
+@import url('https://fonts.googleapis.com/css2?family=DM+Sans:ital,opsz,wght@0,9..40,300;0,9..40,500;0,9..40,700;1,9..40,400&family=Playfair+Display:wght@600;800&display=swap');
+:root {
+    --primary: #e85d04;
+    --primary-light: #f48c06;
+    --bg-dark: #0d1117;
+    --surface: #161b22;
+    --surface-light: #21262d;
+    --text: #f0f6fc;
+    --text-muted: #8b949e;
+    --accent-glow: rgba(232, 93, 4, 0.15);
+    --border: #30363d;
+}
+.gradio-container {
+    font-family: 'DM Sans', sans-serif !important;
+    max-width: 1100px !important;
+    margin: auto !important;
+}
+/* ---------- Badge links: no underline, no gap artifacts ---------- */
+#app-header .badges a {
+    text-decoration: none !important;
+    display: inline-block;
+    line-height: 0;
+    margin: 3px 2px;
+}
+#app-header .badges a img,
+#app-header .badges > img {
+    display: inline-block;
+    vertical-align: middle;
+    margin: 0;
+}
+#app-header .badges {
+    line-height: 1.8;
+}
+/* ---------- Header / 头部 ---------- */
+#app-header {
+    text-align: center;
+    padding: 1.8rem 1rem 0.5rem;
+}
+#app-header h1 {
+    font-size: 1.45rem !important;
+    font-weight: 700 !important;
+    line-height: 1.4;
+    margin-bottom: 0.6rem !important;
+}
+#app-header .badges img {
+    display: inline-block;
+    margin: 3px 2px;
+    vertical-align: middle;
+}
+#app-header .authors {
+    color: var(--text-muted);
+    font-size: 0.92rem;
+    margin: 0.5rem 0 0.2rem;
+    line-height: 1.7;
+}
+#app-header .affiliations {
+    color: var(--text-muted);
+    font-size: 0.85rem;
+    margin-bottom: 0.5rem;
+}
+#app-header .lang-links a {
+    color: var(--primary-light);
+    text-decoration: none;
+    margin: 0 4px;
+    font-size: 0.9rem;
+}
+#app-header .lang-links a:hover { text-decoration: underline; }
+/* ---------- Disclaimer ---------- */
+#disclaimer {
+    border-top: 1px solid var(--border);
+    margin: 24px 0 4px;
+    padding: 14px 4px 4px;
+    font-size: 0.80rem;
+    color: #6e7681;
+    line-height: 1.65;
+    text-align: center;
+}
+#disclaimer strong {
+    color: #8b949e;
+    font-weight: 600;
+}
+/* ---------- Section labels / 分区标题 ---------- */
+.section-title {
+    font-family: 'DM Sans', sans-serif !important;
+    font-weight: 700 !important;
+    font-size: 1rem !important;
+    letter-spacing: 0.06em;
+    text-transform: uppercase;
+    color: var(--primary-light) !important;
+    border-bottom: 2px solid var(--primary);
+    padding-bottom: 6px;
+    margin-bottom: 12px !important;
+}
+/* ---------- Example tabs ---------- */
+.example-tab-label {
+    font-weight: 600 !important;
+    font-size: 0.95rem !important;
+}
+/* ---------- Run button / 合成按钮 ---------- */
+#run-btn {
+    background: linear-gradient(135deg, #e85d04, #dc2f02) !important;
+    border: none !important;
+    color: #fff !important;
+    font-weight: 700 !important;
+    font-size: 1.1rem !important;
+    letter-spacing: 0.04em;
+    padding: 12px 0 !important;
+    border-radius: 10px !important;
+    transition: transform 0.15s, box-shadow 0.25s !important;
+    box-shadow: 0 4px 20px rgba(232, 93, 4, 0.35) !important;
+}
+#run-btn:hover {
+    transform: translateY(-1px) !important;
+    box-shadow: 0 6px 28px rgba(232, 93, 4, 0.5) !important;
+}
+/* ---------- Output audio / 输出音频 ---------- */
+#output-audio {
+    border: 2px solid var(--primary) !important;
+    border-radius: 12px !important;
+    background: var(--accent-glow) !important;
+}
+"""
+# ---------------------------------------------------------------------------
+# Header HTML / 头部 HTML
+# ---------------------------------------------------------------------------
+HEADER_HTML = """
+<div id="app-header" align="center">
+  <h1>
+    🎤 YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
+  </h1>
+  <div class="badges" style="margin: 10px 0;">
+    <img src="https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white" alt="Python">
+    <img src="https://img.shields.io/badge/License-CC%20BY%204.0-4EAA25" alt="License">
+    <a href="https://arxiv.org/abs/0.0" target="_blank">
+      <img src="https://img.shields.io/badge/arXiv-0.0-b31b1b?logo=arxiv&logoColor=white" alt="arXiv">
+    </a>
+    <a href="https://github.com/ASLP-lab/YingMusic-Singer" target="_blank">
+      <img src="https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white" alt="GitHub">
+    </a>
+    <a href="https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer" target="_blank">
+      <img src="https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E" alt="HuggingFace Space">
+    </a>
+    <a href="https://huggingface.co/ASLP-lab/YingMusic-Singer" target="_blank">
+      <img src="https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00" alt="HuggingFace Model">
+    </a>
+    <a href="https://huggingface.co/datasets/ASLP-lab/LyricEditBench" target="_blank">
+      <img src="https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00" alt="LyricEditBench">
+    </a>
+    <a href="https://discord.gg/RXghgWyvrn" target="_blank">
+      <img src="https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white" alt="Discord">
+    </a>
+    <a href="https://github.com/ASLP-lab/YingMusic-Singer/blob/main/assets/wechat_qr.png" target="_blank">
+      <img src="https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white" alt="WeChat">
+    </a>
+    <a href="http://www.npu-aslp.org/" target="_blank">
+      <img src="https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9" alt="ASLP Lab">
+    </a>
+  </div>
+  <p class="authors">
+    <a href="https://orcid.org/0009-0005-5957-8936" target="_blank"><b>Chunbo Hao</b></a>¹² &nbsp;·&nbsp;
+    <a href="https://orcid.org/0009-0003-2602-2910" target="_blank"><b>Junjie Zheng</b></a>² &nbsp;·&nbsp;
+    <a href="https://orcid.org/0009-0001-6706-0572" target="_blank"><b>Guobin Ma</b></a>¹ &nbsp;·&nbsp;
+    <b>Yuepeng Jiang</b>¹ &nbsp;·&nbsp;
+    <b>Huakang Chen</b>¹ &nbsp;·&nbsp;
+    <b>Wenjie Tian</b>¹ &nbsp;·&nbsp;
+    <a href="https://orcid.org/0009-0003-9258-4006" target="_blank"><b>Gongyu Chen</b></a>² &nbsp;·&nbsp;
+    <a href="https://orcid.org/0009-0005-5413-6725" target="_blank"><b>Zihao Chen</b></a>² &nbsp;·&nbsp;
+    <b>Lei Xie</b>¹
+  </p>
+  <p class="affiliations">
+    <sup>1</sup> Northwestern Polytechnical University &nbsp;·&nbsp; <sup>2</sup> Giant Network
+  </p>
+</div>
+"""
+DISCLAIMER_HTML = """
+<div id="disclaimer" style="text-align:center;">
+  <strong>免责声明 / Disclaimer</strong><br>
+  YingMusic-Singer 可用于修改歌词后的歌声合成，支持艺术创作与娱乐应用场景。潜在风险包括未经授权的声音克隆与版权侵权问题。为确保负责任地使用，用户应在使用他人声音前取得授权、公开 AI 的参与情况，并确认音乐内容的原创性。<br>
+  <span style="opacity:0.75;">YingMusic-Singer enables the creation of singing voices with modified lyrics, supporting artistic creation and entertainment. Potential risks include unauthorized voice cloning and copyright infringement. To ensure responsible deployment, users should obtain consent for voice usage, disclose AI involvement, and verify musical originality.</span>
+</div>
+"""
+# ---------------------------------------------------------------------------
+# Build the Gradio UI / 构建界面
+# ---------------------------------------------------------------------------
+def build_ui():
+    with gr.Blocks(
+        css=CUSTOM_CSS, title="YingMusic Singer", theme=gr.themes.Base()
+    ) as demo:
+        # ---- Header ----
+        gr.HTML(HEADER_HTML)
+        gr.HTML("<hr style='border-color:#30363d; margin: 8px 0 18px;'>")
+        # ================================================================
+        # ROW 1 – 音频输入 + 歌词
+        # ================================================================
+        with gr.Row(equal_height=True):
+            with gr.Column(scale=1):
+                gr.Markdown("#### 🎙️ 音频输入 / Audio Inputs", elem_classes="section-title")
+                ref_audio = gr.Audio(
+                    label="参考音频 / Reference Audio（提供音色 / Provides timbre）",
+                    type="filepath",
+                )
+                melody_audio = gr.Audio(
+                    label="旋律音频 / Melody Audio（提供旋律与时长 / Provides melody & duration）",
+                    type="filepath",
+                )
+            with gr.Column(scale=1):
+                gr.Markdown("#### ✏️ 歌词输入 / Lyrics", elem_classes="section-title")
+                ref_text = gr.Textbox(
+                    label="参考音频歌词 / Reference Lyrics",
+                    placeholder="例如 / e.g.：该体谅的不执着|如果那天我",
+                    lines=5,
+                )
+                target_text = gr.Textbox(
+                    label="目标合成歌词 / Target Lyrics",
+                    placeholder="例如 / e.g.：好多天|看不完你",
+                    lines=5,
+                )
+        # ================================================================
+        # ROW 2 – 预设示例 / Example Presets  ← before vocal separation
+        # ================================================================
+        gr.HTML("<hr style='border-color:#30363d; margin: 16px 0 12px;'>")
+        gr.Markdown("#### 🎵 预设示例 / Example Presets", elem_classes="section-title")
+        gr.Markdown(
+            "<small style='color:#8b949e;'>点击任意行自动填入上方输入区域 / Click any row to auto-fill the inputs above</small>"
+        )
+        # Hidden advanced-param components so gr.Examples can reference them
+        # (real sliders rendered inside the accordion below override these values)
+        with gr.Row(visible=False):
+            _sep_flag_ex   = gr.Checkbox(value=True)
+            _mix_flag_ex   = gr.Checkbox(value=False)
+            _sil_ex        = gr.Number(value=0.5)
+            _tshift_ex     = gr.Number(value=0.5)
+            _nfe_ex        = gr.Number(value=32)
+            _cfg_ex        = gr.Number(value=3.0)
+            _seed_ex       = gr.Number(value=-1, precision=0)
+        _example_inputs = [
+            ref_audio, melody_audio, ref_text, target_text,
+            _sep_flag_ex, _mix_flag_ex,
+            _sil_ex, _tshift_ex, _nfe_ex, _cfg_ex, _seed_ex,
+        ]
+        with gr.Tabs():
+            with gr.Tab("🎼 Melody Control"):
+                gr.Examples(
+                    examples=EXAMPLES_MELODY_CONTROL,
+                    inputs=_example_inputs,
+                    label="Melody Control Examples",
+                    examples_per_page=5,
+                )
+            with gr.Tab("✏️ Lyric Edit"):
+                gr.Examples(
+                    examples=EXAMPLES_LYRIC_EDIT,
+                    inputs=_example_inputs,
+                    label="Lyric Edit Examples",
+                    examples_per_page=5,
+                )
+        # ================================================================
+        # ROW 3 – 伴奏分离 / Vocal Separation
+        # ================================================================
+        gr.HTML("<hr style='border-color:#30363d; margin: 16px 0 12px;'>")
+        gr.Markdown("#### 🎚️ 伴奏分离 / Vocal Separation", elem_classes="section-title")
+        gr.HTML("""
+<div style="font-size:0.85rem; color:#8b949e; line-height:1.75; margin: 0 0 12px; padding: 10px 16px;
+            background: rgba(255,255,255,0.03); border-radius: 8px; border: 1px solid #21262d;">
+  <ul style="margin:0; padding-left:1.2em; list-style: none;">
+    <li style="margin-bottom:7px;">
+      💡 若输入的<b style="color:#c9d1d9;">参考音频</b>或<b style="color:#c9d1d9;">旋律音频</b>中含有伴奏或背景噪音，请开启「分离人声后过模型」—— 模型基于纯人声训练，混合音频会影响合成质量。<br>
+      <span style="color:#6e7681; font-size:0.82rem;">If either input contains accompaniment or background noise, enable <i>Separate vocals before synthesis</i> — the model is trained on clean vocals only and mixed audio degrades quality.</span>
+    </li>
+    <li style="margin-bottom:7px;">
+      💡 若两个输入均已为干净人声，则无需开启分离，强行开启反而可能因分离模型引入额外的不稳定性。<br>
+      <span style="color:#6e7681; font-size:0.82rem;">If both inputs are already clean vocals, skip separation — enabling it unnecessarily may introduce artifacts from the separation model.</span>
+    </li>
+    <li>
+      💡 若旋律音频含有伴奏，开启「分离人声后过模型」后，最终输出是否保留伴奏由「输出时混入伴奏」控制。<br>
+      <span style="color:#6e7681; font-size:0.82rem;">If the melody audio contains accompaniment and separation is enabled, use <i>Mix accompaniment into output</i> to decide whether to include it in the final result.</span>
+    </li>
+  </ul>
+</div>
+""")
+        with gr.Row():
+            separate_vocals_flag = gr.Checkbox(
+                value=True,
+                label="分离人声后过模型 / Separate vocals before synthesis",
+                info="从两个输入音频中分别提取纯人声再送入模型 / Extract clean vocals from both inputs before synthesis",
+            )
+            mix_accompaniment_flag = gr.Checkbox(
+                value=False,
+                interactive=True,
+                label="输出时混入伴奏 / Mix accompaniment into output",
+                info="将合成人声与分离出的伴奏混合作为最终输出（需先开启人声分离）/ Mix synthesised vocals with the separated accompaniment (requires separation enabled)",
+            )
+        with gr.Accordion("⚙️ 高级参数 / Advanced Parameters", open=False):
+            with gr.Row():
+                nfe_step = gr.Slider(
+                    minimum=4, maximum=128, value=32, step=1,
+                    label="采样步数 / NFE Steps",
+                    info="步数越多质量越高，但速度更慢 / More steps = higher quality, but slower",
+                )
+                cfg_strength = gr.Slider(
+                    minimum=0.0, maximum=10.0, value=3.0, step=0.1,
+                    label="引导强度 / CFG Strength",
+                    info="无分类器引导强度 / Classifier-Free Guidance strength",
+                )
+                t_shift = gr.Slider(
+                    minimum=0.0, maximum=1.0, value=0.5, step=0.01,
+                    label="采样时间偏移 / t‑shift",
+                )
+            with gr.Row():
+                sil_len_to_end = gr.Slider(
+                    minimum=0.0, maximum=3.0, value=0.5, step=0.1,
+                    label="末尾静音时长（秒）/ Silence Padding (s)",
+                    info="在参考音频末尾追加的静音长度 / Silence appended after reference audio",
+                )
+                seed = gr.Number(
+                    value=-1, precision=0,
+                    label="随机种子 / Random Seed",
+                    info="-1 表示随机生成 / -1 means random",
+                )
+        # ================================================================
+        # ROW 5 – 合成按钮与输出 / Run & Output
+        # ================================================================
+        gr.HTML("<hr style='border-color:#30363d; margin: 12px 0;'>")
+        run_btn = gr.Button("🎤  开始合成 / Start Synthesizing", elem_id="run-btn", size="lg")
+        output_audio = gr.Audio(
+            label="合成结果 / Generated Audio",
+            type="filepath",
+            elem_id="output-audio",
+        )
+        # All inputs for the synthesize() call (uses real sliders, not example placeholders)
+        _all_inputs = [
+            ref_audio, melody_audio, ref_text, target_text,
+            separate_vocals_flag, mix_accompaniment_flag,
+            sil_len_to_end, t_shift, nfe_step, cfg_strength, seed,
+        ]
+        # ================================================================
+        # Event wiring / 事件绑定
+        # ================================================================
+        separate_vocals_flag.change(
+            fn=lambda sep: gr.update(interactive=sep, value=False),
+            inputs=[separate_vocals_flag],
+            outputs=[mix_accompaniment_flag],
+        )
+        run_btn.click(
+            fn=synthesize,
+            inputs=_all_inputs,
+            outputs=output_audio,
+        )
+        # ---- 页脚：免责声明 / Footer: disclaimer ----
+        gr.HTML(DISCLAIMER_HTML)
+    return demo
+# ---------------------------------------------------------------------------
+# Entry point / 启动入口
+# ---------------------------------------------------------------------------
+if __name__ == "__main__":
+    demo = build_ui()
+    demo.queue()
+    demo.launch(server_name="0.0.0.0", server_port=7860, share=False)

assets/YingMusic-Singer.drawio.svg ADDED Viewed

Git LFS Details

SHA256: e8210989d3cf74dfef055cfc21adc3af3183fcfeb901432a8e0347cf4e94b380
Pointer size: 131 Bytes
Size of remote file: 445 kB

assets/results.png ADDED Viewed

Git LFS Details

SHA256: 7510ae52b719a0518d8fc4e1517a2fdc72b5002bb8260bec439a0a052198b4ac
Pointer size: 131 Bytes
Size of remote file: 256 kB

assets/wechat_qr.png ADDED Viewed

Git LFS Details

SHA256: e54baa9890f817f1d67e575e407c37f09909a267761a31ee9f9b0d23649a00d3
Pointer size: 131 Bytes
Size of remote file: 402 kB

infer_api.py ADDED Viewed

	@@ -0,0 +1,216 @@

+"""
+YingMusic Singer - Command Line Inference
+==========================================
+Single-sample inference script, replacing the Gradio Web UI.
+Usage:
+    python infer.py \
+        --ref_audio path/to/ref.wav \
+        --melody_audio path/to/melody.wav \
+        --ref_text "该体谅的不执着|如果那天我" \
+        --target_text "好多天|看不完你" \
+        --output output.wav
+    # Enable vocal separation + accompaniment mixing simultaneously
+    python infer.py \
+        --ref_audio ref.wav \
+        --melody_audio melody.wav \
+        --ref_text "..." \
+        --target_text "..." \
+        --separate_vocals \
+        --mix_accompaniment \
+        --output mixed_output.wav
+"""
+import argparse
+import os
+import random
+import tempfile
+import torch
+import torchaudio
+from initialization import download_files
+# ---------------------------------------------------------------------------
+# Model loading (lazy singleton)
+# ---------------------------------------------------------------------------
+_model = None
+_separator = None
+def get_device():
+    return "cuda:0" if torch.cuda.is_available() else "cpu"
+def get_model():
+    global _model
+    if _model is None:
+        download_files(task="infer")
+        from src.YingMusicSinger.infer.YingMusicSinger import YingMusicSinger
+        _model = YingMusicSinger.from_pretrained("ASLP-lab/YingMusic-Singer")
+        _model = _model.to(get_device())
+    _model.eval()
+    return _model
+def get_separator():
+    global _separator
+    if _separator is None:
+        download_files(task="infer")
+        from src.third_party.MusicSourceSeparationTraining.inference_api import Separator
+        _separator = Separator(
+            config_path="ckpts/config_vocals_mel_band_roformer_kj.yaml",
+            checkpoint_path="ckpts/MelBandRoformer.ckpt",
+        )
+    return _separator
+# ---------------------------------------------------------------------------
+# Vocal separation
+# ---------------------------------------------------------------------------
+def separate_vocals(audio_path: str) -> tuple:
+    """
+    Separate vocals and accompaniment, returns (vocals_path, accompaniment_path).
+    """
+    separator = get_separator()
+    wav, sr = torchaudio.load(audio_path)
+    vocal_wav, inst_wav, out_sr = separator.separate(wav, sr)
+    tmp_dir = tempfile.mkdtemp()
+    vocals_path = os.path.join(tmp_dir, "vocals.wav")
+    accomp_path = os.path.join(tmp_dir, "accompaniment.wav")
+    torchaudio.save(vocals_path, torch.from_numpy(vocal_wav), out_sr)
+    torchaudio.save(accomp_path, torch.from_numpy(inst_wav), out_sr)
+    return vocals_path, accomp_path
+# ---------------------------------------------------------------------------
+# Mix vocals + accompaniment
+# ---------------------------------------------------------------------------
+def mix_vocal_and_accompaniment(vocal_path: str, accomp_path: str, vocal_gain: float = 1.0) -> str:
+    vocal_wav, vocal_sr = torchaudio.load(vocal_path)
+    accomp_wav, accomp_sr = torchaudio.load(accomp_path)
+    if accomp_sr != vocal_sr:
+        accomp_wav = torchaudio.functional.resample(accomp_wav, accomp_sr, vocal_sr)
+    if vocal_wav.shape[0] != accomp_wav.shape[0]:
+        if vocal_wav.shape[0] == 1:
+            vocal_wav = vocal_wav.expand(accomp_wav.shape[0], -1)
+        else:
+            accomp_wav = accomp_wav.expand(vocal_wav.shape[0], -1)
+    min_len = min(vocal_wav.shape[1], accomp_wav.shape[1])
+    mixed = vocal_wav[:, :min_len] * vocal_gain + accomp_wav[:, :min_len]
+    peak = mixed.abs().max()
+    if peak > 1.0:
+        mixed = mixed / peak
+    out_path = os.path.join(tempfile.mkdtemp(), "mixed_output.wav")
+    torchaudio.save(out_path, mixed, sample_rate=vocal_sr)
+    return out_path
+# ---------------------------------------------------------------------------
+# Main inference pipeline
+# ---------------------------------------------------------------------------
+def synthesize(args):
+    actual_seed = args.seed if args.seed >= 0 else random.randint(0, 2**31 - 1)
+    print(f"[INFO] Using seed: {actual_seed}")
+    actual_ref_path = args.ref_audio
+    actual_melody_path = args.melody_audio
+    melody_accomp_path = None
+    # Step 1: Vocal separation (optional)
+    if args.separate_vocals:
+        print("[INFO] Separating vocals from reference audio...")
+        actual_ref_path, _ = separate_vocals(args.ref_audio)
+        print("[INFO] Separating vocals from melody audio...")
+        actual_melody_path, melody_accomp_path = separate_vocals(args.melody_audio)
+    # Step 2: Model inference
+    print("[INFO] Loading model...")
+    model = get_model()
+    print("[INFO] Running synthesis...")
+    audio_tensor, sr = model(
+        ref_audio_path=actual_ref_path,
+        melody_audio_path=actual_melody_path,
+        ref_text=args.ref_text.strip(),
+        target_text=args.target_text.strip(),
+        lrc_align_mode="sentence_level",
+        sil_len_to_end=args.sil_len_to_end,
+        t_shift=args.t_shift,
+        nfe_step=args.nfe_step,
+        cfg_strength=args.cfg_strength,
+        seed=actual_seed,
+    )
+    vocal_out_path = os.path.join(tempfile.mkdtemp(), "vocal_output.wav")
+    torchaudio.save(vocal_out_path, audio_tensor.to("cpu"), sample_rate=sr)
+    # Step 3: Mix accompaniment (optional)
+    if args.separate_vocals and args.mix_accompaniment and melody_accomp_path is not None:
+        print("[INFO] Mixing vocals with accompaniment...")
+        final_path = mix_vocal_and_accompaniment(vocal_out_path, melody_accomp_path)
+    else:
+        final_path = vocal_out_path
+    # Write to specified output path
+    out_wav, out_sr = torchaudio.load(final_path)
+    os.makedirs(os.path.dirname(os.path.abspath(args.output)), exist_ok=True)
+    torchaudio.save(args.output, out_wav, sample_rate=out_sr)
+    print(f"[INFO] Saved to: {args.output}")
+# ---------------------------------------------------------------------------
+# Argument parser
+# ---------------------------------------------------------------------------
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="YingMusic Singer - Single sample command line inference"
+    )
+    # Required
+    parser.add_argument("--ref_audio", required=True,
+                        help="Reference audio path")
+    parser.add_argument("--melody_audio", required=True,
+                        help="Melody audio path")
+    parser.add_argument("--ref_text", required=True,
+                        help="Reference lyrics, use | to separate phrases")
+    parser.add_argument("--target_text", required=True,
+                        help="Target lyrics, use | to separate phrases")
+    # Output
+    parser.add_argument("--output", default="output.wav",
+                        help="Output wav path (default: output.wav)")
+    # Optional flags
+    parser.add_argument("--separate_vocals", action="store_true",
+                        help="Separate vocals before synthesis")
+    parser.add_argument("--mix_accompaniment", action="store_true",
+                        help="Mix accompaniment into output (requires --separate_vocals)")
+    # Advanced params
+    parser.add_argument("--nfe_step", type=int, default=32,
+                        help="NFE steps (default: 32)")
+    parser.add_argument("--cfg_strength", type=float, default=3.0,
+                        help="CFG strength (default: 3.0)")
+    parser.add_argument("--t_shift", type=float, default=0.5,
+                        help="t-shift (default: 0.5)")
+    parser.add_argument("--sil_len_to_end", type=float, default=0.5,
+                        help="Silence padding in seconds (default: 0.5)")
+    parser.add_argument("--seed", type=int, default=-1,
+                        help="Random seed, -1 for random (default: -1)")
+    return parser.parse_args()
+if __name__ == "__main__":
+    args = parse_args()
+    synthesize(args)

inference_mp.py ADDED Viewed

	@@ -0,0 +1,324 @@

+"""
+YingMusicSinger 批量推理脚本
+支持多卡多进程、进度条显示
+输入支持 JSONL 文件 或 LyricEditBench 数据集
+用法:
+    # JSONL 输入，4卡
+    python batch_infer.py \
+        --input_type jsonl \
+        --input_path /path/to/input.jsonl \
+        --output_dir /path/to/output \
+        --ckpt_path /path/to/ckpts \
+        --num_gpus 4
+    # LyricEditBench 输入
+    python batch_infer.py \
+        --input_type lyric_edit_bench \
+        --output_dir /path/to/output \
+        --ckpt_path /path/to/ckpts \
+        --num_gpus 4
+"""
+import argparse
+import json
+import os
+import sys
+import traceback
+from pathlib import Path
+import torch
+import torch.multiprocessing as mp
+import torchaudio
+from datasets import Audio, Dataset
+from huggingface_hub import hf_hub_download
+from tqdm import tqdm
+def load_jsonl(path: str) -> list[dict]:
+    items = []
+    with open(path, "r", encoding="utf-8") as f:
+        for line in f:
+            line = line.strip()
+            if line:
+                items.append(json.loads(line))
+    return items
+def build_dataset_from_local(gtsinger_root: str):
+    """
+    Build LyricEditBench dataset using your local GTSinger directory.
+    Args:
+        gtsinger_root: Root directory of your local GTSinger dataset.
+    """
+    # Download the inherited metadata from HuggingFace
+    json_path = hf_hub_download(
+        repo_id="ASLP-lab/LyricEditBench",
+        filename="GTSinger_Inherited.json",
+        repo_type="dataset",
+    )
+    with open(json_path, "r") as f:
+        data = json.load(f)
+    gtsinger_root = str(Path(gtsinger_root).resolve())
+    # Prepend local root to relative paths
+    for item in data:
+        item["melody_ref_path"] = os.path.join(gtsinger_root, item["melody_ref_path"])
+        item["timbre_ref_path"] = os.path.join(gtsinger_root, item["timbre_ref_path"])
+        # Set audio fields to the resolved file paths
+        item["melody_ref_audio"] = item["melody_ref_path"]
+        item["timbre_ref_audio"] = item["timbre_ref_path"]
+    # Build HuggingFace Dataset with Audio features
+    ds = Dataset.from_list(data)
+    ds = ds.cast_column("melody_ref_audio", Audio())
+    ds = ds.cast_column("timbre_ref_audio", Audio())
+    return ds
+def load_subset(data: list, subset_id: str) -> list:
+    """Filter dataset by a subset ID list."""
+    subset_path = hf_hub_download(
+        repo_id="ASLP-lab/LyricEditBench",
+        filename=f"id_lists/{subset_id}.txt",
+        repo_type="dataset",
+    )
+    with open(subset_path, "r") as f:
+        id_set = set(line.strip() for line in f if line.strip())
+    return [item for item in data if item["id"] in id_set]
+def load_lyric_edit_bench(input_type) -> list[dict]:
+    # If you have GTsinger downloaded, use this:
+    ds_full = build_dataset_from_local(
+        "/user-fs/chenzihao/zhengjunjie/datas/Music/openvocaldata/GTSinger"
+    )
+    # else, you kan use this:
+    # from datasets import load_dataset
+    # ds_full = load_dataset("ASLP-lab/LyricEditBench", split="test")
+    # ds_full loaded
+    subset_1k = load_subset(ds_full, "1K")
+    print(f"Loaded {len(subset_1k)} items")
+    items = []
+    for row in subset_1k:
+        if input_type == "lyric_edit_bench_melody_control":
+            items.append(
+                {
+                    "id": row.get("id", ""),
+                    "melody_ref_path": row.get("melody_ref_path", ""),
+                    "gen_text": row.get("gen_text", ""),
+                    "timbre_ref_path": row.get("timbre_ref_path", ""),
+                    "timbre_ref_text": row.get("timbre_ref_text", ""),
+                }
+            )
+        elif input_type == "lyric_edit_bench_sing_edit":
+            items.append(
+                {
+                    "id": row.get("id", ""),
+                    "melody_ref_path": row.get("melody_ref_path", ""),
+                    "gen_text": row.get("gen_text", ""),
+                    "timbre_ref_path": row.get("melody_ref_path", ""),
+                    "timbre_ref_text": row.get("melody_ref_text", ""),
+                }
+            )
+        else:
+            assert 0
+    return items
+def worker(
+    rank: int,
+    world_size: int,
+    items: list[dict],
+    output_dir: str,
+    ckpt_path: str,
+    args: argparse.Namespace,
+):
+    """每个 GPU 上运行的 worker 进程"""
+    device = f"cuda:{rank}"
+    torch.cuda.set_device(rank)
+    # ---- 加载模型 ----
+    from src.YingMusicSinger.infer.YingMusicSinger import YingMusicSinger
+    model = YingMusicSinger.from_pretrained(ckpt_path)
+    model.to(device)
+    model.eval()
+    # ---- 分片: 每个 worker 处理自己那份 ----
+    shard = items[rank::world_size]
+    # ---- 只在 rank 0 显示进度条 ----
+    pbar = tqdm(
+        shard,
+        desc=f"[GPU {rank}]",
+        position=rank,
+        leave=True,
+        disable=(rank != 0 and not args.show_all_progress),
+    )
+    success, fail = 0, 0
+    for item in pbar:
+        item_id = item.get("id", f"unknown_{success + fail}")
+        out_path = os.path.join(output_dir, f"{item_id}.wav")
+        # 跳过已存在的文件
+        if os.path.exists(out_path) and not args.overwrite:
+            success += 1
+            pbar.set_postfix(ok=success, err=fail)
+            continue
+        try:
+            with torch.no_grad():
+                audio, sr = model(
+                    ref_audio_path=item["timbre_ref_path"],
+                    melody_audio_path=item["melody_ref_path"],
+                    ref_text=item.get("timbre_ref_text", ""),
+                    target_text=item.get("gen_text", ""),
+                    lrc_align_mode=args.lrc_align_mode,
+                    sil_len_to_end=args.sil_len_to_end,
+                    t_shift=args.t_shift,
+                    nfe_step=args.nfe_step,
+                    cfg_strength=args.cfg_strength,
+                    seed=args.seed
+                    if args.seed != -1
+                    else torch.randint(0, 2**32, (1,)).item(),
+                )
+            torchaudio.save(out_path, audio, sample_rate=sr)
+            success += 1
+        except Exception as e:
+            fail += 1
+            print(f"\n[GPU {rank}] ERROR on {item_id}: {e}", file=sys.stderr)
+            if args.verbose:
+                traceback.print_exc()
+        pbar.set_postfix(ok=success, err=fail)
+    pbar.close()
+    print(f"[GPU {rank}] Done. success={success}, fail={fail}")
+def main():
+    parser = argparse.ArgumentParser(description="YingMusicSinger 批量推理")
+    # ---- 输入 ----
+    parser.add_argument(
+        "--input_type",
+        type=str,
+        required=True,
+        choices=[
+            "jsonl",
+            "lyric_edit_bench_melody_control",
+            "lyric_edit_bench_sing_edit",
+        ],
+        help="输入类型: jsonl / lyric_edit_bench_melody_control 或 lyric_edit_bench_sing_edit",
+    )
+    parser.add_argument(
+        "--input_path",
+        type=str,
+        default=None,
+        help="JSONL 文件路径 (input_type=jsonl 时必填)",
+    )
+    # ---- 输出 ----
+    parser.add_argument(
+        "--output_dir",
+        type=str,
+        required=True,
+        help="输出目录",
+    )
+    # ---- 模型 ----
+    parser.add_argument(
+        "--ckpt_path",
+        type=str,
+        required=False,
+        help="模型 checkpoint 路径 (save_pretrained 保存的目录)",
+        default=None,
+    )
+    # ---- 推理参数 ----
+    parser.add_argument(
+        "--num_gpus", type=int, default=None, help="使用 GPU 数量，默认全部"
+    )
+    parser.add_argument(
+        "--lrc_align_mode",
+        type=str,
+        default="sentence_level",
+        choices=["sentence_level"],
+    )
+    parser.add_argument("--sil_len_to_end", type=float, default=0.5)
+    parser.add_argument("--t_shift", type=float, default=0.5)
+    parser.add_argument("--nfe_step", type=int, default=32)
+    parser.add_argument("--cfg_strength", type=float, default=3.0)
+    parser.add_argument("--seed", type=int, default=-1)
+    # ---- 其它 ----
+    parser.add_argument("--overwrite", action="store_true", help="覆盖已有输出文件")
+    parser.add_argument(
+        "--show_all_progress", action="store_true", help="所有 GPU 都显示进度条"
+    )
+    parser.add_argument("--verbose", action="store_true", help="打印详细错误信息")
+    args = parser.parse_args()
+    # ---- 校验 ----
+    if args.input_type == "jsonl":
+        assert args.input_path is not None, "--input_path 是 jsonl 模式下必填的"
+        assert os.path.isfile(args.input_path), f"文件不存在: {args.input_path}"
+    # ---- 加载数据 ----
+    print("加载数据...")
+    if args.input_type == "jsonl":
+        items = load_jsonl(args.input_path)
+    else:
+        items = load_lyric_edit_bench(args.input_type)
+    print(f"共 {len(items)} 条数据")
+    # ---- 确定 GPU 数量 ----
+    available_gpus = torch.cuda.device_count()
+    num_gpus = args.num_gpus or available_gpus
+    num_gpus = min(num_gpus, available_gpus, len(items))
+    assert num_gpus > 0, "没有可用的 GPU"
+    print(f"使用 {num_gpus} 张 GPU")
+    # ---- 创建输出目录 ----
+    os.makedirs(args.output_dir, exist_ok=True)
+    # ---- 启动多进程 ----
+    if num_gpus == 1:
+        # 单卡直接跑，不需要 spawn
+        worker(0, 1, items, args.output_dir, args.ckpt_path, args)
+    else:
+        mp.set_start_method("spawn", force=True)
+        processes = []
+        for rank in range(num_gpus):
+            p = mp.Process(
+                target=worker,
+                args=(rank, num_gpus, items, args.output_dir, args.ckpt_path, args),
+            )
+            p.start()
+            processes.append(p)
+        for p in processes:
+            p.join()
+    print(f"\n推理完成! 输出目录: {args.output_dir}")
+if __name__ == "__main__":
+    main()

inference_mp.sh ADDED Viewed

	@@ -0,0 +1,41 @@

+# JSONL 输入
+# python inference_mp.py \
+#     --input_type jsonl \
+#     --input_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out/input_jsonl.jsonl \
+#     --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out/Jsonl \
+#     --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts \
+#     --num_gpus 8 \
+#     --show_all_progress
+# # LyricEditBench 输入：
+# python inference_mp.py \
+#     --input_type lyric_edit_bench_melody_control \
+#     --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out2/LyricEditBench_melody_control \
+#     --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts \
+#     --num_gpus 8 \
+#     --overwrite
+# python inference_mp.py \
+#     --input_type lyric_edit_bench_sing_edit \
+#     --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out2/LyricEditBench_sing_edit \
+#     --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts \
+#     --num_gpus 8 \
+#     --overwrite
+python inference_mp.py \
+    --input_type lyric_edit_bench_melody_control \
+    --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out4/LyricEditBench_melody_control \
+    --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts-ema \
+    --num_gpus 8 \
+    --overwrite
+# python inference_mp.py \
+#     --input_type lyric_edit_bench_sing_edit \
+#     --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out3/LyricEditBench_sing_edit \
+#     --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts-ema-fix-extra-sli \
+#     --num_gpus 8 \
+#     --overwrite