xjsc0 commited on
Commit
ffbb4ab
·
1 Parent(s): 4566e34
LICENSE ADDED
@@ -0,0 +1,25 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Creative Commons Attribution 4.0 International Public License
2
+
3
+ By exercising the Licensed Rights (defined below), You accept and agree
4
+ to be bound by the terms and conditions of this Creative Commons
5
+ Attribution 4.0 International Public License ("Public License").
6
+ To the extent this Public License may be interpreted as a contract,
7
+ You are granted the Licensed Rights in consideration of Your acceptance
8
+ of these terms and conditions, and the Licensor grants You such rights
9
+ in consideration of benefits the Licensor receives from making
10
+ the Licensed Material available under these terms and conditions.
11
+
12
+ You are free to:
13
+ - Share — copy and redistribute the material in any medium or format
14
+ - Adapt — remix, transform, and build upon the material for any purpose, even commercially.
15
+
16
+ Under the following terms:
17
+ - Attribution — You must give appropriate credit, provide a link to the license,
18
+ and indicate if changes were made. You may do so in any reasonable manner,
19
+ but not in any way that suggests the licensor endorses you or your use.
20
+
21
+ No additional restrictions — You may not apply legal terms or
22
+ technological measures that legally restrict others from doing
23
+ anything the license permits.
24
+
25
+ Full license text: https://creativecommons.org/licenses/by/4.0/legalcode
LICENSE-STABILITY ADDED
@@ -0,0 +1,57 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ STABILITY AI COMMUNITY LICENSE AGREEMENT
2
+
3
+ Last Updated: July 5, 2024
4
+ 1. INTRODUCTION
5
+
6
+ This Agreement applies to any individual person or entity (“You”, “Your” or “Licensee”) that uses or distributes any portion or element of the Stability AI Materials or Derivative Works thereof for any Research & Non-Commercial or Commercial purpose. Capitalized terms not otherwise defined herein are defined in Section V below.
7
+
8
+ This Agreement is intended to allow research, non-commercial, and limited commercial uses of the Models free of charge. In order to ensure that certain limited commercial uses of the Models continue to be allowed, this Agreement preserves free access to the Models for people or organizations generating annual revenue of less than US $1,000,000 (or local currency equivalent).
9
+
10
+ By clicking “I Accept” or by using or distributing or using any portion or element of the Stability Materials or Derivative Works, You agree that You have read, understood and are bound by the terms of this Agreement. If You are acting on behalf of a company, organization or other entity, then “You” includes you and that entity, and You agree that You: (i) are an authorized representative of such entity with the authority to bind such entity to this Agreement, and (ii) You agree to the terms of this Agreement on that entity’s behalf.
11
+
12
+ 2. RESEARCH & NON-COMMERCIAL USE LICENSE
13
+
14
+ Subject to the terms of this Agreement, Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Research or Non-Commercial Purpose. “Research Purpose” means academic or scientific advancement, and in each case, is not primarily intended for commercial advantage or monetary compensation to You or others. “Non-Commercial Purpose” means any purpose other than a Research Purpose that is not primarily intended for commercial advantage or monetary compensation to You or others, such as personal use (i.e., hobbyist) or evaluation and testing.
15
+
16
+ 3. COMMERCIAL USE LICENSE
17
+
18
+ Subject to the terms of this Agreement (including the remainder of this Section III), Stability AI grants You a non-exclusive, worldwide, non-transferable, non-sublicensable, revocable and royalty-free limited license under Stability AI’s intellectual property or other rights owned by Stability AI embodied in the Stability AI Materials to use, reproduce, distribute, and create Derivative Works of, and make modifications to, the Stability AI Materials for any Commercial Purpose. “Commercial Purpose” means any purpose other than a Research Purpose or Non-Commercial Purpose that is primarily intended for commercial advantage or monetary compensation to You or others, including but not limited to, (i) creating, modifying, or distributing Your product or service, including via a hosted service or application programming interface, and (ii) for Your business’s or organization’s internal operations.
19
+ If You are using or distributing the Stability AI Materials for a Commercial Purpose, You must register with Stability AI at (https://stability.ai/community-license). If at any time You or Your Affiliate(s), either individually or in aggregate, generate more than USD $1,000,000 in annual revenue (or the equivalent thereof in Your local currency), regardless of whether that revenue is generated directly or indirectly from the Stability AI Materials or Derivative Works, any licenses granted to You under this Agreement shall terminate as of such date. You must request a license from Stability AI at (https://stability.ai/enterprise) , which Stability AI may grant to You in its sole discretion. If you receive Stability AI Materials, or any Derivative Works thereof, from a Licensee as part of an integrated end user product, then Section III of this Agreement will not apply to you.
20
+
21
+ 4. GENERAL TERMS
22
+
23
+ Your Research, Non-Commercial, and Commercial License(s) under this Agreement are subject to the following terms.
24
+ a. Distribution & Attribution. If You distribute or make available the Stability AI Materials or a Derivative Work to a third party, or a product or service that uses any portion of them, You shall: (i) provide a copy of this Agreement to that third party, (ii) retain the following attribution notice within a "Notice" text file distributed as a part of such copies: "This Stability AI Model is licensed under the Stability AI Community License, Copyright © Stability AI Ltd. All Rights Reserved”, and (iii) prominently display “Powered by Stability AI” on a related website, user interface, blogpost, about page, or product documentation. If You create a Derivative Work, You may add your own attribution notice(s) to the “Notice” text file included with that Derivative Work, provided that You clearly indicate which attributions apply to the Stability AI Materials and state in the “Notice” text file that You changed the Stability AI Materials and how it was modified.
25
+ b. Use Restrictions. Your use of the Stability AI Materials and Derivative Works, including any output or results of the Stability AI Materials or Derivative Works, must comply with applicable laws and regulations (including Trade Control Laws and equivalent regulations) and adhere to the Documentation and Stability AI’s AUP, which is hereby incorporated by reference. Furthermore, You will not use the Stability AI Materials or Derivative Works, or any output or results of the Stability AI Materials or Derivative Works, to create or improve any foundational generative AI model (excluding the Models or Derivative Works).
26
+ c. Intellectual Property.
27
+ (i) Trademark License. No trademark licenses are granted under this Agreement, and in connection with the Stability AI Materials or Derivative Works, You may not use any name or mark owned by or associated with Stability AI or any of its Affiliates, except as required under Section IV(a) herein.
28
+ (ii) Ownership of Derivative Works. As between You and Stability AI, You are the owner of Derivative Works You create, subject to Stability AI’s ownership of the Stability AI Materials and any Derivative Works made by or for Stability AI.
29
+ (iii) Ownership of Outputs. As between You and Stability AI, You own any outputs generated from the Models or Derivative Works to the extent permitted by applicable law.
30
+ (iv) Disputes. If You or Your Affiliate(s) institute litigation or other proceedings against Stability AI (including a cross-claim or counterclaim in a lawsuit) alleging that the Stability AI Materials, Derivative Works or associated outputs or results, or any portion of any of the foregoing, constitutes infringement of intellectual property or other rights owned or licensable by You, then any licenses granted to You under this Agreement shall terminate as of the date such litigation or claim is filed or instituted. You will indemnify and hold harmless Stability AI from and against any claim by any third party arising out of or related to Your use or distribution of the Stability AI Materials or Derivative Works in violation of this Agreement.
31
+ (v) Feedback. From time to time, You may provide Stability AI with verbal and/or written suggestions, comments or other feedback related to Stability AI’s existing or prospective technology, products or services (collectively, “Feedback”). You are not obligated to provide Stability AI with Feedback, but to the extent that You do, You hereby grant Stability AI a perpetual, irrevocable, royalty-free, fully-paid, sub-licensable, transferable, non-exclusive, worldwide right and license to exploit the Feedback in any manner without restriction. Your Feedback is provided “AS IS” and You make no warranties whatsoever about any Feedback.
32
+ d. Disclaimer Of Warranty. UNLESS REQUIRED BY APPLICABLE LAW, THE STABILITY AI MATERIALS AND ANY OUTPUT AND RESULTS THEREFROM ARE PROVIDED ON AN "AS IS" BASIS, WITHOUT WARRANTIES OF ANY KIND, EITHER EXPRESS OR IMPLIED, INCLUDING, WITHOUT LIMITATION, ANY WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY, OR FITNESS FOR A PARTICULAR PURPOSE. YOU ARE SOLELY RESPONSIBLE FOR DETERMINING THE APPROPRIATENESS OR LAWFULNESS OF USING OR REDISTRIBUTING THE STABILITY AI MATERIALS, DERIVATIVE WORKS OR ANY OUTPUT OR RESULTS AND ASSUME ANY RISKS ASSOCIATED WITH YOUR USE OF THE STABILITY AI MATERIALS, DERIVATIVE WORKS AND ANY OUTPUT AND RESULTS.
33
+ e. Limitation Of Liability. IN NO EVENT WILL STABILITY AI OR ITS AFFILIATES BE LIABLE UNDER ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, TORT, NEGLIGENCE, PRODUCTS LIABILITY, OR OTHERWISE, ARISING OUT OF THIS AGREEMENT, FOR ANY LOST PROFITS OR ANY DIRECT, INDIRECT, SPECIAL, CONSEQUENTIAL, INCIDENTAL, EXEMPLARY OR PUNITIVE DAMAGES, EVEN IF STABILITY AI OR ITS AFFILIATES HAVE BEEN ADVISED OF THE POSSIBILITY OF ANY OF THE FOREGOING.
34
+ f. Term And Termination. The term of this Agreement will commence upon Your acceptance of this Agreement or access to the Stability AI Materials and will continue in full force and effect until terminated in accordance with the terms and conditions herein. Stability AI may terminate this Agreement if You are in breach of any term or condition of this Agreement. Upon termination of this Agreement, You shall delete and cease use of any Stability AI Materials or Derivative Works. Section IV(d), (e), and (g) shall survive the termination of this Agreement.
35
+ g. Governing Law. This Agreement will be governed by and constructed in accordance with the laws of the United States and the State of California without regard to choice of law principles, and the UN Convention on Contracts for International Sale of Goods does not apply to this Agreement.
36
+
37
+ 5. DEFINITIONS
38
+
39
+ “Affiliate(s)” means any entity that directly or indirectly controls, is controlled by, or is under common control with the subject entity; for purposes of this definition, “control” means direct or indirect ownership or control of more than 50% of the voting interests of the subject entity.
40
+
41
+ "Agreement" means this Stability AI Community License Agreement.
42
+
43
+ “AUP” means the Stability AI Acceptable Use Policy available at (https://stability.ai/use-policy), as may be updated from time to time.
44
+
45
+ "Derivative Work(s)” means (a) any derivative work of the Stability AI Materials as recognized by U.S. copyright laws and (b) any modifications to a Model, and any other model created which is based on or derived from the Model or the Model’s output, including “fine tune” and “low-rank adaptation” models derived from a Model or a Model’s output, but do not include the output of any Model.
46
+
47
+ “Documentation” means any specifications, manuals, documentation, and other written information provided by Stability AI related to the Software or Models.
48
+
49
+ “Model(s)" means, collectively, Stability AI’s proprietary models and algorithms, including machine-learning models, trained model weights and other elements of the foregoing listed on Stability’s Core Models Webpage available at (https://stability.ai/core-models), as may be updated from time to time.
50
+
51
+ "Stability AI" or "we" means Stability AI Ltd. and its Affiliates.
52
+
53
+ "Software" means Stability AI’s proprietary software made available under this Agreement now or in the future.
54
+
55
+ “Stability AI Materials” means, collectively, Stability’s proprietary Models, Software and Documentation (and any portion or combination thereof) made available under this Agreement.
56
+
57
+ “Trade Control Laws” means any applicable U.S. and non-U.S. export control and trade sanctions laws and regulations.
README.md CHANGED
@@ -15,75 +15,222 @@ short_description: Edit lyrics, keep the melody
15
  fullWidth: true
16
  ---
17
 
18
- # YingMusic-Singer
19
- YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
20
 
21
- ## Environment Setup
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
22
 
23
- ### 1. Install from Scratch
24
  ```bash
25
  conda create -n YingMusic-Singer python=3.10
26
  conda activate YingMusic-Singer
27
 
28
- # uv is much quicker
29
  pip install uv
30
  uv pip install -r requirements.txt
31
  ```
32
 
33
- ### 2. Pre-built Conda Environment for One-Click Deployment (Nvidia / AMD CPU Only)
34
 
35
- Coming soon
 
 
 
36
 
37
- ## 推理
 
 
 
 
38
 
39
- ### 使用huggingface Space(线上体验)
40
 
41
- 访问https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer之后,就可以快速体验
42
-
43
- ### 使用Docker运行
44
 
 
45
  docker build -t yingmusic-singer .
 
 
 
46
 
47
- ### 使用python运行
 
 
48
 
49
- git clone
50
- cd
51
- python initialization.py --task infer
52
 
53
- # for Gradio
54
 
55
- python app.py
56
 
57
- # 多进程 Inference
58
- # 1. 你需要确保所有输入模型的均为分离之后的纯人声,如果没有分离,可以参考/src/third_party/MusicSourceSeparationTraining/inference_api.py 进行分离
59
- # 2. jsonl 文件的格式为,每行一个json,{}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
60
  python batch_infer.py \
61
  --input_type jsonl \
62
  --input_path /path/to/input.jsonl \
63
  --output_dir /path/to/output \
64
  --ckpt_path /path/to/ckpts \
65
  --num_gpus 4
 
66
 
67
- # 多进程 Inference(LyricEditBench melody control)
 
 
68
  python inference_mp.py \
69
  --input_type lyric_edit_bench_melody_control \
70
- --output_dir path/to/ \
71
- LyricEditBench_melody_control \
72
  --ckpt_path ASLP-lab/YingMusic-Singer \
73
  --num_gpus 8
 
74
 
75
- # 多进程 Inference(LyricEditBench sing edit)
 
 
76
  python inference_mp.py \
77
  --input_type lyric_edit_bench_sing_edit \
78
- --output_dir path/to/ \
79
- LyricEditBench_melody_control \
80
  --ckpt_path ASLP-lab/YingMusic-Singer \
81
  --num_gpus 8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
 
 
 
 
 
 
83
 
84
 
85
- ## License
86
 
87
- The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), except for the following components:
88
 
89
- The VAE model weights and inference code (in `src/YingMusic-Singer/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).
 
15
  fullWidth: true
16
  ---
17
 
18
+ <div align="center">
 
19
 
20
+ <h1>🎤 YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance</h1>
21
+
22
+ <p>
23
+ <a href="">English</a> | <a href="README_ZH.md">中文</a>
24
+ </p>
25
+
26
+
27
+ ![Python](https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white)
28
+ ![License](https://img.shields.io/badge/License-CC--BY--4.0-lightgrey)
29
+ [![arXiv Paper](https://img.shields.io/badge/arXiv-0.0-b31b1b?logo=arxiv&logoColor=white)](https://arxiv.org/abs/0.0)
30
+ [![GitHub](https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer)
31
+ [![HuggingFace Space](https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E)](https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer)
32
+ [![HuggingFace Model](https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00)](https://huggingface.co/ASLP-lab/YingMusic-Singer)
33
+ [![Dataset LyricEditBench](https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00)](https://huggingface.co/datasets/ASLP-lab/LyricEditBench)
34
+ [![Discord](https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white)](https://discord.gg/RXghgWyvrn)
35
+ [![WeChat](https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white)](https://github.com/ASLP-lab/YingMusic-Singer/blob/main/assets/wechat_qr.png)
36
+ [![Lab](https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9)](http://www.npu-aslp.org/)
37
+
38
+
39
+ <p>
40
+ <a href="https://orcid.org/0009-0005-5957-8936"><b>Chunbo Hao</b></a>¹² ·
41
+ <a href="https://orcid.org/0009-0003-2602-2910"><b>Junjie Zheng</b></a>² ·
42
+ <a href="https://orcid.org/0009-0001-6706-0572"><b>Guobin Ma</b></a>¹ ·
43
+ <b>Yuepeng Jiang</b>¹ ·
44
+ <b>Huakang Chen</b>¹ ·
45
+ <b>Wenjie Tian</b>¹ ·
46
+ <a href="https://orcid.org/0009-0003-9258-4006"><b>Gongyu Chen</b></a>² ·
47
+ <a href="https://orcid.org/0009-0005-5413-6725"><b>Zihao Chen</b></a>² ·
48
+ <b>Lei Xie</b>¹
49
+ </p>
50
+
51
+ <p>
52
+ <sup>1</sup> Northwestern Polytechnical University · <sup>2</sup> Giant Network
53
+ </p>
54
+
55
+ </div>
56
+
57
+ <div align="center">
58
+ <img src="./assets/YingMusic-Singer.drawio.svg" alt="YingMusic-Singer Architecture" width="90%">
59
+ <p><i>Overall architecture of YingMusic-Singer. Left: SFT training pipeline. Right: GRPO training pipeline.</i></p>
60
+ </div>
61
+
62
+
63
+ ## 📖 Introduction
64
+
65
+ **YingMusic-Singer** is a fully diffusion-based singing voice synthesis model that enables **melody-controllable singing voice editing with flexible lyric manipulation**, requiring no manual alignment or precise phoneme annotation.
66
+
67
+ Given only three inputs — an optional timbre reference, a melody-providing singing clip, and modified lyrics — YingMusic-Singer synthesizes high-fidelity singing voices at **44.1 kHz** while faithfully preserving the original melody.
68
+
69
+
70
+ ## ✨ Key Features
71
+
72
+ - **Annotation-free**: No manual lyric-MIDI alignment required at inference
73
+ - **Flexible lyric manipulation**: Supports 6 editing types — partial/full changes, insertion, deletion, translation (CN↔EN), and code-switching
74
+ - **Strong melody preservation**: CKA-based melody alignment loss + GRPO-based optimization
75
+ - **Bilingual**: Unified IPA tokenizer for both Chinese and English
76
+ - **High fidelity**: 44.1 kHz stereo output via Stable Audio 2 VAE
77
+
78
+
79
+ ## 🚀 Quick Start
80
+
81
+ ### Option 1: Install from Scratch
82
 
 
83
  ```bash
84
  conda create -n YingMusic-Singer python=3.10
85
  conda activate YingMusic-Singer
86
 
87
+ # uv is much faster than pip
88
  pip install uv
89
  uv pip install -r requirements.txt
90
  ```
91
 
92
+ ### Option 2: Pre-built Conda Environment
93
 
94
+ 1. Download and install **Miniconda** from https://repo.anaconda.com/miniconda/ for your platform. Verify with `conda --version`.
95
+ 2. Download the pre-built environment package for your setup from the table below.
96
+ 3. In your Conda directory, navigate to `envs/` and create a folder named `YingMusic-Singer`.
97
+ 4. Move the downloaded package into that folder, then extract it with `tar -xvf <package_name>`.
98
 
99
+ | CPU Architecture | GPU | OS | Download |
100
+ |------------------|--------|---------|----------|
101
+ | ARM | NVIDIA | Linux | Coming soon |
102
+ | AMD64 | NVIDIA | Linux | Coming soon |
103
+ | AMD64 | NVIDIA | Windows | Coming soon |
104
 
105
+ ### Option 3: Docker
106
 
107
+ Build the image:
 
 
108
 
109
+ ```bash
110
  docker build -t yingmusic-singer .
111
+ ```
112
+
113
+ Run inference:
114
 
115
+ ```bash
116
+ docker run --gpus all -it yingmusic-singer
117
+ ```
118
 
 
 
 
119
 
120
+ ## 🎵 Inference
121
 
122
+ ### Option 1: Online Demo (HuggingFace Space)
123
 
124
+ Visit https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer to try the model instantly in your browser.
125
+
126
+ ### Option 2: Local Gradio App (same as online demo)
127
+
128
+ ```bash
129
+ python app_local.py
130
+ ```
131
+
132
+ ### Option 3: Command-line Inference
133
+
134
+ ```bash
135
+ python infer_api.py \
136
+ --ref_audio path/to/ref.wav \
137
+ --melody_audio path/to/melody.wav \
138
+ --ref_text "该体谅的不执着|如果那天我" \
139
+ --target_text "好多天|看不完你" \
140
+ --output output.wav
141
+ ```
142
+
143
+ Enable vocal separation and accompaniment mixing:
144
+
145
+ ```bash
146
+ python infer_api.py \
147
+ --ref_audio ref.wav \
148
+ --melody_audio melody.wav \
149
+ --ref_text "..." \
150
+ --target_text "..." \
151
+ --separate_vocals \ # separate vocals from the input before processing
152
+ --mix_accompaniment \ # mix the synthesized vocal back with the accompaniment
153
+ --output mixed_output.wav
154
+ ```
155
+ ### Option 4: Batch Inference
156
+
157
+ > **Note**: All audio fed to the model must be pure vocal tracks (no accompaniment). If your inputs contain accompaniment, run vocal separation first using `src/third_party/MusicSourceSeparationTraining/inference_api.py`.
158
+
159
+ The input JSONL file should contain one JSON object per line, formatted as follows:
160
+
161
+ ```json
162
+ {"id": "1", "melody_ref_path": "XXX", "gen_text": "好多天|看不完你", "timbre_ref_path": "XXX", "timbre_ref_text": "该体谅的不执着|如果那天我"}
163
+ ```
164
+
165
+ ```bash
166
  python batch_infer.py \
167
  --input_type jsonl \
168
  --input_path /path/to/input.jsonl \
169
  --output_dir /path/to/output \
170
  --ckpt_path /path/to/ckpts \
171
  --num_gpus 4
172
+ ```
173
 
174
+ Multi-process inference on **LyricEditBench (melody control)** — the test set will be downloaded automatically:
175
+
176
+ ```bash
177
  python inference_mp.py \
178
  --input_type lyric_edit_bench_melody_control \
179
+ --output_dir path/to/LyricEditBench_melody_control \
 
180
  --ckpt_path ASLP-lab/YingMusic-Singer \
181
  --num_gpus 8
182
+ ```
183
 
184
+ Multi-process inference on **LyricEditBench (singing edit)**:
185
+
186
+ ```bash
187
  python inference_mp.py \
188
  --input_type lyric_edit_bench_sing_edit \
189
+ --output_dir path/to/LyricEditBench_sing_edit \
 
190
  --ckpt_path ASLP-lab/YingMusic-Singer \
191
  --num_gpus 8
192
+ ```
193
+
194
+ ## 🏗️ Model Architecture
195
+
196
+ YingMusic-Singer consists of four core components:
197
+
198
+ | Component | Description |
199
+ |-----------|-------------|
200
+ | **VAE** | Stable Audio 2 encoder/decoder; downsamples stereo 44.1 kHz audio by 2048× |
201
+ | **Melody Extractor** | Encoder of a pretrained MIDI extraction model (SOME); captures disentangled melody information |
202
+ | **IPA Tokenizer** | Converts Chinese & English lyrics into a unified phoneme sequence with sentence-level alignment |
203
+ | **DiT-based CFM** | Conditional flow matching backbone following F5-TTS (22 layers, 16 heads, hidden dim 1024) |
204
+
205
+ **Total parameters**: ~727.3M (453.6M CFM + 156.1M VAE + 117.6M Melody Extractor)
206
+
207
+
208
+ ## 📊 LyricEditBench
209
+
210
+ We introduce **LyricEditBench**, the first benchmark for melody-preserving lyric modification evaluation, built on [GTSinger](https://github.com/GTSinger/GTSinger). The dataset is available on HuggingFace at https://huggingface.co/datasets/ASLP-lab/LyricEditBench.
211
+
212
+ ### Results
213
+
214
+ <div align="center">
215
+ <p><i>Comparison with baseline models on LyricEditBench across task types (Table 1) and languages. Metrics — P: PER, S: SIM, F: F0-CORR, V: VS — are detailed in Section 3. Best results in <b>bold</b>.</i></p>
216
+ <img src="./assets/results.png" alt="LyricEditBench Results" width="90%">
217
+ </div>
218
+
219
+
220
+ ## 🙏 Acknowledgements
221
+
222
+ This work builds upon the following open-source projects:
223
 
224
+ - [F5-TTS](https://github.com/SWivid/F5-TTS) — DiT-based CFM backbone
225
+ - [Stable Audio 2](https://github.com/Stability-AI/stable-audio-tools) — VAE architecture
226
+ - [SOME](https://github.com/openvpi/SOME) — Melody Extractor
227
+ - [DiffRhythm](https://github.com/ASLP-lab/DiffRhythm) — Sentence-level alignment strategy
228
+ - [GTSinger](https://github.com/GTSinger/GTSinger) — Benchmark base corpus
229
+ - [Emilia](https://huggingface.co/datasets/amphion/Emilia-Dataset) — TTS pretraining data
230
 
231
 
232
+ ## 📄 License
233
 
234
+ The code and model weights in this project are licensed under [CC BY 4.0](https://creativecommons.org/licenses/by/4.0/), **except** for the following:
235
 
236
+ The VAE model weights and inference code (in `src/YingMusic-Singer/utils/stable-audio-tools`) are derived from [Stable Audio Open](https://huggingface.co/stabilityai/stable-audio-open-1.0) by Stability AI, and are licensed under the [Stability AI Community License](./LICENSE-STABILITY).
app_local.py ADDED
@@ -0,0 +1,640 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ YingMusic Singer - Gradio Web Interface
3
+ ========================================
4
+ 基于参考音色与旋律音频的歌声合成系统,支持自动分离人声与伴奏。
5
+ A singing voice synthesis system powered by YingMusicSinger,
6
+ with built-in vocal/accompaniment separation via MelBandRoformer.
7
+ """
8
+
9
+ import os
10
+ import tempfile
11
+
12
+ import gradio as gr
13
+ import torch
14
+ import torchaudio
15
+
16
+ from initialization import download_files
17
+
18
+ IS_HF_SPACE = os.environ.get("SPACE_ID") is not None
19
+ HF_ENABLE = False
20
+ LOCAL_DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
21
+
22
+
23
+ def local_move2gpu(x):
24
+ """Move models to GPU on local environment. No-op on HuggingFace Spaces (ZeroGPU handles it)."""
25
+ if IS_HF_SPACE:
26
+ return x
27
+ return x.to(LOCAL_DEVICE)
28
+
29
+
30
+ # ---------------------------------------------------------------------------
31
+ # Model loading (lazy, singleton) / 模型懒加载(单例)
32
+ # ---------------------------------------------------------------------------
33
+ _model = None
34
+ _separator = None
35
+
36
+
37
+ def _load_model_impl():
38
+ """Internal: load YingMusicSinger (no GPU decorator, called inside GPU context)."""
39
+ download_files(task="infer")
40
+ global _model
41
+ if _model is None:
42
+ from src.YingMusicSinger.infer.YingMusicSinger import YingMusicSinger
43
+ _model = YingMusicSinger.from_pretrained("ASLP-lab/YingMusic-Singer")
44
+ _model = local_move2gpu(_model)
45
+ _model.eval()
46
+ return _model
47
+
48
+
49
+ def _load_separator_impl():
50
+ """Internal: load MelBandRoformer separator (no GPU decorator, called inside GPU context)."""
51
+ download_files(task="infer")
52
+ global _separator
53
+ if _separator is None:
54
+ from src.third_party.MusicSourceSeparationTraining.inference_api import Separator
55
+ _separator = Separator(
56
+ config_path="ckpts/config_vocals_mel_band_roformer_kj.yaml",
57
+ checkpoint_path="ckpts/MelBandRoformer.ckpt",
58
+ )
59
+ return _separator
60
+
61
+
62
+ # ---------------------------------------------------------------------------
63
+ # Vocal separation utilities / 人声分离工具
64
+ # ---------------------------------------------------------------------------
65
+ def _separate_vocals_impl(audio_path: str) -> tuple:
66
+ """
67
+ Separate audio into vocals and accompaniment using MelBandRoformer.
68
+ Must be called within an active GPU context.
69
+ """
70
+ separator = _load_separator_impl()
71
+
72
+ wav, sr = torchaudio.load(audio_path)
73
+ vocal_wav, inst_wav, out_sr = separator.separate(wav, sr)
74
+
75
+ tmp_dir = tempfile.mkdtemp()
76
+ vocals_path = os.path.join(tmp_dir, "vocals.wav")
77
+ accomp_path = os.path.join(tmp_dir, "accompaniment.wav")
78
+ torchaudio.save(vocals_path, torch.from_numpy(vocal_wav), out_sr)
79
+ torchaudio.save(accomp_path, torch.from_numpy(inst_wav), out_sr)
80
+
81
+ return vocals_path, accomp_path
82
+
83
+
84
+ def mix_vocal_and_accompaniment(
85
+ vocal_path: str,
86
+ accomp_path: str,
87
+ vocal_gain: float = 1.0,
88
+ ) -> str:
89
+ """
90
+ 将合成人声与伴奏混合为最终音频。
91
+ Mix synthesised vocals with accompaniment into a final audio file.
92
+ """
93
+ vocal_wav, vocal_sr = torchaudio.load(vocal_path)
94
+ accomp_wav, accomp_sr = torchaudio.load(accomp_path)
95
+
96
+ if accomp_sr != vocal_sr:
97
+ accomp_wav = torchaudio.functional.resample(accomp_wav, accomp_sr, vocal_sr)
98
+
99
+ if vocal_wav.shape[0] != accomp_wav.shape[0]:
100
+ if vocal_wav.shape[0] == 1:
101
+ vocal_wav = vocal_wav.expand(accomp_wav.shape[0], -1)
102
+ else:
103
+ accomp_wav = accomp_wav.expand(vocal_wav.shape[0], -1)
104
+
105
+ min_len = min(vocal_wav.shape[1], accomp_wav.shape[1])
106
+ vocal_wav = vocal_wav[:, :min_len]
107
+ accomp_wav = accomp_wav[:, :min_len]
108
+
109
+ mixed = vocal_wav * vocal_gain + accomp_wav
110
+ peak = mixed.abs().max()
111
+ if peak > 1.0:
112
+ mixed = mixed / peak
113
+
114
+ out_path = os.path.join(tempfile.mkdtemp(), "mixed_output.wav")
115
+ torchaudio.save(out_path, mixed, sample_rate=vocal_sr)
116
+ return out_path
117
+
118
+
119
+ # ---------------------------------------------------------------------------
120
+ # Inference wrapper / 推理入口
121
+ # Single @spaces.GPU scope covers ALL heavy work (separation + synthesis)
122
+ # so models stay resident in GPU memory across steps within one call.
123
+ # ---------------------------------------------------------------------------
124
+
125
+ def synthesize(
126
+ ref_audio,
127
+ melody_audio,
128
+ ref_text,
129
+ target_text,
130
+ separate_vocals_flag,
131
+ mix_accompaniment_flag,
132
+ sil_len_to_end,
133
+ t_shift,
134
+ nfe_step,
135
+ cfg_strength,
136
+ seed,
137
+ ):
138
+ """
139
+ 主合成流程 / Main synthesis pipeline.
140
+
141
+ 1. (可选) 用 MelBandRoformer 分离参考音频和旋律音频的人声与伴奏
142
+ 2. 送入 YingMusicSinger 合成
143
+ 3. (可选) 将合成人声与旋律音频的伴奏混合
144
+ """
145
+ import random
146
+
147
+ # ---- 输入校验 / Input validation ----------------------------------------
148
+ if ref_audio is None:
149
+ raise gr.Error("请上传参考音频 / Please upload Reference Audio")
150
+ if melody_audio is None:
151
+ raise gr.Error("请上传旋律音频 / Please upload Melody Audio")
152
+ if not ref_text.strip():
153
+ raise gr.Error("请输入参考音频对应的歌词 / Please enter Reference Text")
154
+ if not target_text.strip():
155
+ raise gr.Error("请输入目标合成歌词 / Please enter Target Text")
156
+
157
+ ref_audio_path = ref_audio if isinstance(ref_audio, str) else ref_audio[0]
158
+ melody_audio_path = (
159
+ melody_audio if isinstance(melody_audio, str) else melody_audio[0]
160
+ )
161
+
162
+ actual_seed = int(seed)
163
+ if actual_seed < 0:
164
+ actual_seed = random.randint(0, 2**31 - 1)
165
+
166
+ # ---- Step 1: 人声分离(合并在同一 GPU 上下文中)/ Vocal separation (same GPU context) ----
167
+ melody_accomp_path = None
168
+ actual_ref_path = ref_audio_path
169
+ actual_melody_path = melody_audio_path
170
+
171
+ if separate_vocals_flag:
172
+ ref_vocals_path, _ = _separate_vocals_impl(ref_audio_path)
173
+ actual_ref_path = ref_vocals_path
174
+
175
+ melody_vocals_path, melody_accomp_path = _separate_vocals_impl(melody_audio_path)
176
+ actual_melody_path = melody_vocals_path
177
+
178
+ # ---- Step 2: 模型推理 / Model inference (same GPU context) ---------------
179
+ model = _load_model_impl()
180
+
181
+ audio_tensor, sr = model(
182
+ ref_audio_path=actual_ref_path,
183
+ melody_audio_path=actual_melody_path,
184
+ ref_text=ref_text.strip(),
185
+ target_text=target_text.strip(),
186
+ lrc_align_mode="sentence_level",
187
+ sil_len_to_end=float(sil_len_to_end),
188
+ t_shift=float(t_shift),
189
+ nfe_step=int(nfe_step),
190
+ cfg_strength=float(cfg_strength),
191
+ seed=actual_seed,
192
+ )
193
+
194
+ vocal_out_path = os.path.join(tempfile.mkdtemp(), "vocal_output.wav")
195
+ torchaudio.save(vocal_out_path, audio_tensor.to("cpu"), sample_rate=sr)
196
+
197
+ # ---- Step 3: 混合伴奏 / Mix accompaniment (optional) ---------------------
198
+ if (
199
+ separate_vocals_flag
200
+ and mix_accompaniment_flag
201
+ and melody_accomp_path is not None
202
+ ):
203
+ final_path = mix_vocal_and_accompaniment(vocal_out_path, melody_accomp_path)
204
+ return final_path
205
+ else:
206
+ return vocal_out_path
207
+
208
+
209
+ # ---------------------------------------------------------------------------
210
+ # Example presets / 预设示例
211
+ # ---------------------------------------------------------------------------
212
+ EXAMPLES_MELODY_CONTROL = [
213
+ # [ref_audio, melody_audio, ref_text, target_text, sep, mix, sil, t_shift, nfe, cfg, seed]
214
+ [
215
+ "examples/melody_control/ref_01.wav",
216
+ "examples/melody_control/melody_01.wav",
217
+ "该体谅的不执着|如果那天我",
218
+ "好多天|看不完你",
219
+ True, False, 0.5, 0.5, 32, 3.0, -1,
220
+ ],
221
+ [
222
+ "examples/melody_control/ref_02.wav",
223
+ "examples/melody_control/melody_02.wav",
224
+ "月光下的身影|渐渐模糊",
225
+ "星光照亮前路|指引方向",
226
+ True, False, 0.5, 0.5, 32, 3.0, -1,
227
+ ],
228
+ ]
229
+
230
+ EXAMPLES_LYRIC_EDIT = [
231
+ [
232
+ "examples/lyric_edit/ref_01.wav",
233
+ "examples/lyric_edit/melody_01.wav",
234
+ "该体谅的不执着|如果那天我",
235
+ "忘不掉的笑容|留在心里面",
236
+ True, False, 0.5, 0.5, 32, 3.0, -1,
237
+ ],
238
+ [
239
+ "examples/lyric_edit/ref_02.wav",
240
+ "examples/lyric_edit/melody_02.wav",
241
+ "夜深了还不睡|想着你的脸",
242
+ "春风又吹过来|带走我思念",
243
+ True, False, 0.5, 0.5, 32, 3.0, -1,
244
+ ],
245
+ ]
246
+
247
+
248
+ # ---------------------------------------------------------------------------
249
+ # Custom CSS / 自定义样式
250
+ # ---------------------------------------------------------------------------
251
+ CUSTOM_CSS = """
252
+ @import url('https://fonts.googleapis.com/css2?family=DM+Sans:ital,opsz,wght@0,9..40,300;0,9..40,500;0,9..40,700;1,9..40,400&family=Playfair+Display:wght@600;800&display=swap');
253
+
254
+ :root {
255
+ --primary: #e85d04;
256
+ --primary-light: #f48c06;
257
+ --bg-dark: #0d1117;
258
+ --surface: #161b22;
259
+ --surface-light: #21262d;
260
+ --text: #f0f6fc;
261
+ --text-muted: #8b949e;
262
+ --accent-glow: rgba(232, 93, 4, 0.15);
263
+ --border: #30363d;
264
+ }
265
+
266
+ .gradio-container {
267
+ font-family: 'DM Sans', sans-serif !important;
268
+ max-width: 1100px !important;
269
+ margin: auto !important;
270
+ }
271
+
272
+ /* ---------- Badge links: no underline, no gap artifacts ---------- */
273
+ #app-header .badges a {
274
+ text-decoration: none !important;
275
+ display: inline-block;
276
+ line-height: 0;
277
+ margin: 3px 2px;
278
+ }
279
+ #app-header .badges a img,
280
+ #app-header .badges > img {
281
+ display: inline-block;
282
+ vertical-align: middle;
283
+ margin: 0;
284
+ }
285
+ #app-header .badges {
286
+ line-height: 1.8;
287
+ }
288
+
289
+ /* ---------- Header / 头部 ---------- */
290
+ #app-header {
291
+ text-align: center;
292
+ padding: 1.8rem 1rem 0.5rem;
293
+ }
294
+ #app-header h1 {
295
+ font-size: 1.45rem !important;
296
+ font-weight: 700 !important;
297
+ line-height: 1.4;
298
+ margin-bottom: 0.6rem !important;
299
+ }
300
+ #app-header .badges img {
301
+ display: inline-block;
302
+ margin: 3px 2px;
303
+ vertical-align: middle;
304
+ }
305
+ #app-header .authors {
306
+ color: var(--text-muted);
307
+ font-size: 0.92rem;
308
+ margin: 0.5rem 0 0.2rem;
309
+ line-height: 1.7;
310
+ }
311
+ #app-header .affiliations {
312
+ color: var(--text-muted);
313
+ font-size: 0.85rem;
314
+ margin-bottom: 0.5rem;
315
+ }
316
+ #app-header .lang-links a {
317
+ color: var(--primary-light);
318
+ text-decoration: none;
319
+ margin: 0 4px;
320
+ font-size: 0.9rem;
321
+ }
322
+ #app-header .lang-links a:hover { text-decoration: underline; }
323
+
324
+ /* ---------- Disclaimer ---------- */
325
+ #disclaimer {
326
+ border-top: 1px solid var(--border);
327
+ margin: 24px 0 4px;
328
+ padding: 14px 4px 4px;
329
+ font-size: 0.80rem;
330
+ color: #6e7681;
331
+ line-height: 1.65;
332
+ text-align: center;
333
+ }
334
+ #disclaimer strong {
335
+ color: #8b949e;
336
+ font-weight: 600;
337
+ }
338
+
339
+ /* ---------- Section labels / 分区标题 ---------- */
340
+ .section-title {
341
+ font-family: 'DM Sans', sans-serif !important;
342
+ font-weight: 700 !important;
343
+ font-size: 1rem !important;
344
+ letter-spacing: 0.06em;
345
+ text-transform: uppercase;
346
+ color: var(--primary-light) !important;
347
+ border-bottom: 2px solid var(--primary);
348
+ padding-bottom: 6px;
349
+ margin-bottom: 12px !important;
350
+ }
351
+
352
+ /* ---------- Example tabs ---------- */
353
+ .example-tab-label {
354
+ font-weight: 600 !important;
355
+ font-size: 0.95rem !important;
356
+ }
357
+
358
+ /* ---------- Run button / 合成按钮 ---------- */
359
+ #run-btn {
360
+ background: linear-gradient(135deg, #e85d04, #dc2f02) !important;
361
+ border: none !important;
362
+ color: #fff !important;
363
+ font-weight: 700 !important;
364
+ font-size: 1.1rem !important;
365
+ letter-spacing: 0.04em;
366
+ padding: 12px 0 !important;
367
+ border-radius: 10px !important;
368
+ transition: transform 0.15s, box-shadow 0.25s !important;
369
+ box-shadow: 0 4px 20px rgba(232, 93, 4, 0.35) !important;
370
+ }
371
+ #run-btn:hover {
372
+ transform: translateY(-1px) !important;
373
+ box-shadow: 0 6px 28px rgba(232, 93, 4, 0.5) !important;
374
+ }
375
+
376
+ /* ---------- Output audio / 输出音频 ---------- */
377
+ #output-audio {
378
+ border: 2px solid var(--primary) !important;
379
+ border-radius: 12px !important;
380
+ background: var(--accent-glow) !important;
381
+ }
382
+ """
383
+
384
+ # ---------------------------------------------------------------------------
385
+ # Header HTML / 头部 HTML
386
+ # ---------------------------------------------------------------------------
387
+ HEADER_HTML = """
388
+ <div id="app-header" align="center">
389
+ <h1>
390
+ 🎤 YingMusic-Singer: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
391
+ </h1>
392
+
393
+ <div class="badges" style="margin: 10px 0;">
394
+ <img src="https://img.shields.io/badge/Python-3.10-3776AB?logo=python&logoColor=white" alt="Python">
395
+ <img src="https://img.shields.io/badge/License-CC%20BY%204.0-4EAA25" alt="License">
396
+ <a href="https://arxiv.org/abs/0.0" target="_blank">
397
+ <img src="https://img.shields.io/badge/arXiv-0.0-b31b1b?logo=arxiv&logoColor=white" alt="arXiv">
398
+ </a>
399
+ <a href="https://github.com/ASLP-lab/YingMusic-Singer" target="_blank">
400
+ <img src="https://img.shields.io/badge/GitHub-YingMusic--Singer-181717?logo=github&logoColor=white" alt="GitHub">
401
+ </a>
402
+ <a href="https://huggingface.co/spaces/ASLP-lab/YingMusic-Singer" target="_blank">
403
+ <img src="https://img.shields.io/badge/🤗%20HuggingFace-Space-FFD21E" alt="HuggingFace Space">
404
+ </a>
405
+ <a href="https://huggingface.co/ASLP-lab/YingMusic-Singer" target="_blank">
406
+ <img src="https://img.shields.io/badge/🤗%20HuggingFace-Model-FF9D00" alt="HuggingFace Model">
407
+ </a>
408
+ <a href="https://huggingface.co/datasets/ASLP-lab/LyricEditBench" target="_blank">
409
+ <img src="https://img.shields.io/badge/🤗%20HuggingFace-LyricEditBench-FF6F00" alt="LyricEditBench">
410
+ </a>
411
+ <a href="https://discord.gg/RXghgWyvrn" target="_blank">
412
+ <img src="https://img.shields.io/badge/Discord-Join%20Us-5865F2?logo=discord&logoColor=white" alt="Discord">
413
+ </a>
414
+ <a href="https://github.com/ASLP-lab/YingMusic-Singer/blob/main/assets/wechat_qr.png" target="_blank">
415
+ <img src="https://img.shields.io/badge/WeChat-Group-07C160?logo=wechat&logoColor=white" alt="WeChat">
416
+ </a>
417
+ <a href="http://www.npu-aslp.org/" target="_blank">
418
+ <img src="https://img.shields.io/badge/🏫%20ASLP-Lab-4A90D9" alt="ASLP Lab">
419
+ </a>
420
+ </div>
421
+
422
+ <p class="authors">
423
+ <a href="https://orcid.org/0009-0005-5957-8936" target="_blank"><b>Chunbo Hao</b></a>¹² &nbsp;·&nbsp;
424
+ <a href="https://orcid.org/0009-0003-2602-2910" target="_blank"><b>Junjie Zheng</b></a>² &nbsp;·&nbsp;
425
+ <a href="https://orcid.org/0009-0001-6706-0572" target="_blank"><b>Guobin Ma</b></a>¹ &nbsp;·&nbsp;
426
+ <b>Yuepeng Jiang</b>¹ &nbsp;·&nbsp;
427
+ <b>Huakang Chen</b>¹ &nbsp;·&nbsp;
428
+ <b>Wenjie Tian</b>¹ &nbsp;·&nbsp;
429
+ <a href="https://orcid.org/0009-0003-9258-4006" target="_blank"><b>Gongyu Chen</b></a>² &nbsp;·&nbsp;
430
+ <a href="https://orcid.org/0009-0005-5413-6725" target="_blank"><b>Zihao Chen</b></a>² &nbsp;·&nbsp;
431
+ <b>Lei Xie</b>¹
432
+ </p>
433
+ <p class="affiliations">
434
+ <sup>1</sup> Northwestern Polytechnical University &nbsp;·&nbsp; <sup>2</sup> Giant Network
435
+ </p>
436
+ </div>
437
+ """
438
+
439
+ DISCLAIMER_HTML = """
440
+ <div id="disclaimer" style="text-align:center;">
441
+ <strong>免责声明 / Disclaimer</strong><br>
442
+ YingMusic-Singer 可用于修改歌词后的歌声合成,支持艺术创作与娱乐应用场景。潜在风险包括未经授权的声音克隆与版权侵权问题。为确保负责任地使用,用户应在使用他人声音前取得授权、公开 AI 的参与情况,并确认音乐内容的原创性。<br>
443
+ <span style="opacity:0.75;">YingMusic-Singer enables the creation of singing voices with modified lyrics, supporting artistic creation and entertainment. Potential risks include unauthorized voice cloning and copyright infringement. To ensure responsible deployment, users should obtain consent for voice usage, disclose AI involvement, and verify musical originality.</span>
444
+ </div>
445
+ """
446
+
447
+
448
+ # ---------------------------------------------------------------------------
449
+ # Build the Gradio UI / 构建界面
450
+ # ---------------------------------------------------------------------------
451
+ def build_ui():
452
+ with gr.Blocks(
453
+ css=CUSTOM_CSS, title="YingMusic Singer", theme=gr.themes.Base()
454
+ ) as demo:
455
+
456
+ # ---- Header ----
457
+ gr.HTML(HEADER_HTML)
458
+ gr.HTML("<hr style='border-color:#30363d; margin: 8px 0 18px;'>")
459
+
460
+ # ================================================================
461
+ # ROW 1 – 音频输入 + 歌词
462
+ # ================================================================
463
+ with gr.Row(equal_height=True):
464
+ with gr.Column(scale=1):
465
+ gr.Markdown("#### 🎙️ 音频输入 / Audio Inputs", elem_classes="section-title")
466
+ ref_audio = gr.Audio(
467
+ label="参考音频 / Reference Audio(提供音色 / Provides timbre)",
468
+ type="filepath",
469
+ )
470
+ melody_audio = gr.Audio(
471
+ label="旋律音频 / Melody Audio(提供旋律与时长 / Provides melody & duration)",
472
+ type="filepath",
473
+ )
474
+ with gr.Column(scale=1):
475
+ gr.Markdown("#### ✏️ 歌词输入 / Lyrics", elem_classes="section-title")
476
+ ref_text = gr.Textbox(
477
+ label="参考音频歌词 / Reference Lyrics",
478
+ placeholder="例如 / e.g.:该体谅的不执着|如果那天我",
479
+ lines=5,
480
+ )
481
+ target_text = gr.Textbox(
482
+ label="目标合成歌词 / Target Lyrics",
483
+ placeholder="例如 / e.g.:好多天|看不完你",
484
+ lines=5,
485
+ )
486
+
487
+ # ================================================================
488
+ # ROW 2 – 预设示例 / Example Presets ← before vocal separation
489
+ # ================================================================
490
+ gr.HTML("<hr style='border-color:#30363d; margin: 16px 0 12px;'>")
491
+ gr.Markdown("#### 🎵 预设示例 / Example Presets", elem_classes="section-title")
492
+ gr.Markdown(
493
+ "<small style='color:#8b949e;'>点击任意行自动填入上方输入区域 / Click any row to auto-fill the inputs above</small>"
494
+ )
495
+
496
+ # Hidden advanced-param components so gr.Examples can reference them
497
+ # (real sliders rendered inside the accordion below override these values)
498
+ with gr.Row(visible=False):
499
+ _sep_flag_ex = gr.Checkbox(value=True)
500
+ _mix_flag_ex = gr.Checkbox(value=False)
501
+ _sil_ex = gr.Number(value=0.5)
502
+ _tshift_ex = gr.Number(value=0.5)
503
+ _nfe_ex = gr.Number(value=32)
504
+ _cfg_ex = gr.Number(value=3.0)
505
+ _seed_ex = gr.Number(value=-1, precision=0)
506
+
507
+ _example_inputs = [
508
+ ref_audio, melody_audio, ref_text, target_text,
509
+ _sep_flag_ex, _mix_flag_ex,
510
+ _sil_ex, _tshift_ex, _nfe_ex, _cfg_ex, _seed_ex,
511
+ ]
512
+
513
+ with gr.Tabs():
514
+ with gr.Tab("🎼 Melody Control"):
515
+ gr.Examples(
516
+ examples=EXAMPLES_MELODY_CONTROL,
517
+ inputs=_example_inputs,
518
+ label="Melody Control Examples",
519
+ examples_per_page=5,
520
+ )
521
+ with gr.Tab("✏️ Lyric Edit"):
522
+ gr.Examples(
523
+ examples=EXAMPLES_LYRIC_EDIT,
524
+ inputs=_example_inputs,
525
+ label="Lyric Edit Examples",
526
+ examples_per_page=5,
527
+ )
528
+
529
+ # ================================================================
530
+ # ROW 3 – 伴奏分离 / Vocal Separation
531
+ # ================================================================
532
+ gr.HTML("<hr style='border-color:#30363d; margin: 16px 0 12px;'>")
533
+ gr.Markdown("#### 🎚️ 伴奏分离 / Vocal Separation", elem_classes="section-title")
534
+ gr.HTML("""
535
+ <div style="font-size:0.85rem; color:#8b949e; line-height:1.75; margin: 0 0 12px; padding: 10px 16px;
536
+ background: rgba(255,255,255,0.03); border-radius: 8px; border: 1px solid #21262d;">
537
+ <ul style="margin:0; padding-left:1.2em; list-style: none;">
538
+ <li style="margin-bottom:7px;">
539
+ 💡 若输入的<b style="color:#c9d1d9;">参考音频</b>或<b style="color:#c9d1d9;">旋律音频</b>中含有伴奏或背景噪音,请开启「分离人声后过模型」—— 模型基于纯人声训练,混合音频会影响合成质量。<br>
540
+ <span style="color:#6e7681; font-size:0.82rem;">If either input contains accompaniment or background noise, enable <i>Separate vocals before synthesis</i> — the model is trained on clean vocals only and mixed audio degrades quality.</span>
541
+ </li>
542
+ <li style="margin-bottom:7px;">
543
+ 💡 若两个输入均已为干净人声,则无需开启分离,强行开启反而可能因分离模型引入额外的不稳定性。<br>
544
+ <span style="color:#6e7681; font-size:0.82rem;">If both inputs are already clean vocals, skip separation — enabling it unnecessarily may introduce artifacts from the separation model.</span>
545
+ </li>
546
+ <li>
547
+ 💡 若旋律音频含有伴奏,开启「分离人声后过模型」后,最终输出是否保留伴奏由「输出时混入伴奏」控制。<br>
548
+ <span style="color:#6e7681; font-size:0.82rem;">If the melody audio contains accompaniment and separation is enabled, use <i>Mix accompaniment into output</i> to decide whether to include it in the final result.</span>
549
+ </li>
550
+ </ul>
551
+ </div>
552
+ """)
553
+ with gr.Row():
554
+ separate_vocals_flag = gr.Checkbox(
555
+ value=True,
556
+ label="分离人声后过模型 / Separate vocals before synthesis",
557
+ info="从两个输入音频中分别提取纯人声再送入模型 / Extract clean vocals from both inputs before synthesis",
558
+ )
559
+ mix_accompaniment_flag = gr.Checkbox(
560
+ value=False,
561
+ interactive=True,
562
+ label="输出时混入伴奏 / Mix accompaniment into output",
563
+ info="将合成人声与分离出的伴奏混合作为最终输出(需先开启人声分离)/ Mix synthesised vocals with the separated accompaniment (requires separation enabled)",
564
+ )
565
+
566
+ with gr.Accordion("⚙️ 高级参数 / Advanced Parameters", open=False):
567
+ with gr.Row():
568
+ nfe_step = gr.Slider(
569
+ minimum=4, maximum=128, value=32, step=1,
570
+ label="采样步数 / NFE Steps",
571
+ info="步数越多质量越高,但速度更慢 / More steps = higher quality, but slower",
572
+ )
573
+ cfg_strength = gr.Slider(
574
+ minimum=0.0, maximum=10.0, value=3.0, step=0.1,
575
+ label="引导强度 / CFG Strength",
576
+ info="无分类器引导强度 / Classifier-Free Guidance strength",
577
+ )
578
+ t_shift = gr.Slider(
579
+ minimum=0.0, maximum=1.0, value=0.5, step=0.01,
580
+ label="采样时间偏移 / t‑shift",
581
+ )
582
+ with gr.Row():
583
+ sil_len_to_end = gr.Slider(
584
+ minimum=0.0, maximum=3.0, value=0.5, step=0.1,
585
+ label="末尾静音时长(秒)/ Silence Padding (s)",
586
+ info="在参考音频末尾追加的静音长度 / Silence appended after reference audio",
587
+ )
588
+ seed = gr.Number(
589
+ value=-1, precision=0,
590
+ label="随机种子 / Random Seed",
591
+ info="-1 表示随机生成 / -1 means random",
592
+ )
593
+
594
+ # ================================================================
595
+ # ROW 5 – 合成按钮与输出 / Run & Output
596
+ # ================================================================
597
+ gr.HTML("<hr style='border-color:#30363d; margin: 12px 0;'>")
598
+ run_btn = gr.Button("🎤 开始合成 / Start Synthesizing", elem_id="run-btn", size="lg")
599
+
600
+ output_audio = gr.Audio(
601
+ label="合成结果 / Generated Audio",
602
+ type="filepath",
603
+ elem_id="output-audio",
604
+ )
605
+
606
+ # All inputs for the synthesize() call (uses real sliders, not example placeholders)
607
+ _all_inputs = [
608
+ ref_audio, melody_audio, ref_text, target_text,
609
+ separate_vocals_flag, mix_accompaniment_flag,
610
+ sil_len_to_end, t_shift, nfe_step, cfg_strength, seed,
611
+ ]
612
+
613
+ # ================================================================
614
+ # Event wiring / 事件绑定
615
+ # ================================================================
616
+ separate_vocals_flag.change(
617
+ fn=lambda sep: gr.update(interactive=sep, value=False),
618
+ inputs=[separate_vocals_flag],
619
+ outputs=[mix_accompaniment_flag],
620
+ )
621
+
622
+ run_btn.click(
623
+ fn=synthesize,
624
+ inputs=_all_inputs,
625
+ outputs=output_audio,
626
+ )
627
+
628
+ # ---- 页脚:免责声明 / Footer: disclaimer ----
629
+ gr.HTML(DISCLAIMER_HTML)
630
+
631
+ return demo
632
+
633
+
634
+ # ---------------------------------------------------------------------------
635
+ # Entry point / 启动入口
636
+ # ---------------------------------------------------------------------------
637
+ if __name__ == "__main__":
638
+ demo = build_ui()
639
+ demo.queue()
640
+ demo.launch(server_name="0.0.0.0", server_port=7860, share=False)
assets/YingMusic-Singer.drawio.svg ADDED

Git LFS Details

  • SHA256: e8210989d3cf74dfef055cfc21adc3af3183fcfeb901432a8e0347cf4e94b380
  • Pointer size: 131 Bytes
  • Size of remote file: 445 kB
assets/results.png ADDED

Git LFS Details

  • SHA256: 7510ae52b719a0518d8fc4e1517a2fdc72b5002bb8260bec439a0a052198b4ac
  • Pointer size: 131 Bytes
  • Size of remote file: 256 kB
assets/wechat_qr.png ADDED

Git LFS Details

  • SHA256: e54baa9890f817f1d67e575e407c37f09909a267761a31ee9f9b0d23649a00d3
  • Pointer size: 131 Bytes
  • Size of remote file: 402 kB
infer_api.py ADDED
@@ -0,0 +1,216 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ YingMusic Singer - Command Line Inference
3
+ ==========================================
4
+ Single-sample inference script, replacing the Gradio Web UI.
5
+
6
+ Usage:
7
+ python infer.py \
8
+ --ref_audio path/to/ref.wav \
9
+ --melody_audio path/to/melody.wav \
10
+ --ref_text "该体谅的不执着|如果那天我" \
11
+ --target_text "好多天|看不完你" \
12
+ --output output.wav
13
+
14
+ # Enable vocal separation + accompaniment mixing simultaneously
15
+ python infer.py \
16
+ --ref_audio ref.wav \
17
+ --melody_audio melody.wav \
18
+ --ref_text "..." \
19
+ --target_text "..." \
20
+ --separate_vocals \
21
+ --mix_accompaniment \
22
+ --output mixed_output.wav
23
+ """
24
+
25
+ import argparse
26
+ import os
27
+ import random
28
+ import tempfile
29
+
30
+ import torch
31
+ import torchaudio
32
+
33
+ from initialization import download_files
34
+
35
+
36
+ # ---------------------------------------------------------------------------
37
+ # Model loading (lazy singleton)
38
+ # ---------------------------------------------------------------------------
39
+ _model = None
40
+ _separator = None
41
+
42
+
43
+ def get_device():
44
+ return "cuda:0" if torch.cuda.is_available() else "cpu"
45
+
46
+
47
+ def get_model():
48
+ global _model
49
+ if _model is None:
50
+ download_files(task="infer")
51
+ from src.YingMusicSinger.infer.YingMusicSinger import YingMusicSinger
52
+ _model = YingMusicSinger.from_pretrained("ASLP-lab/YingMusic-Singer")
53
+ _model = _model.to(get_device())
54
+ _model.eval()
55
+ return _model
56
+
57
+
58
+ def get_separator():
59
+ global _separator
60
+ if _separator is None:
61
+ download_files(task="infer")
62
+ from src.third_party.MusicSourceSeparationTraining.inference_api import Separator
63
+ _separator = Separator(
64
+ config_path="ckpts/config_vocals_mel_band_roformer_kj.yaml",
65
+ checkpoint_path="ckpts/MelBandRoformer.ckpt",
66
+ )
67
+ return _separator
68
+
69
+
70
+ # ---------------------------------------------------------------------------
71
+ # Vocal separation
72
+ # ---------------------------------------------------------------------------
73
+ def separate_vocals(audio_path: str) -> tuple:
74
+ """
75
+ Separate vocals and accompaniment, returns (vocals_path, accompaniment_path).
76
+ """
77
+ separator = get_separator()
78
+ wav, sr = torchaudio.load(audio_path)
79
+ vocal_wav, inst_wav, out_sr = separator.separate(wav, sr)
80
+
81
+ tmp_dir = tempfile.mkdtemp()
82
+ vocals_path = os.path.join(tmp_dir, "vocals.wav")
83
+ accomp_path = os.path.join(tmp_dir, "accompaniment.wav")
84
+ torchaudio.save(vocals_path, torch.from_numpy(vocal_wav), out_sr)
85
+ torchaudio.save(accomp_path, torch.from_numpy(inst_wav), out_sr)
86
+ return vocals_path, accomp_path
87
+
88
+
89
+ # ---------------------------------------------------------------------------
90
+ # Mix vocals + accompaniment
91
+ # ---------------------------------------------------------------------------
92
+ def mix_vocal_and_accompaniment(vocal_path: str, accomp_path: str, vocal_gain: float = 1.0) -> str:
93
+ vocal_wav, vocal_sr = torchaudio.load(vocal_path)
94
+ accomp_wav, accomp_sr = torchaudio.load(accomp_path)
95
+
96
+ if accomp_sr != vocal_sr:
97
+ accomp_wav = torchaudio.functional.resample(accomp_wav, accomp_sr, vocal_sr)
98
+
99
+ if vocal_wav.shape[0] != accomp_wav.shape[0]:
100
+ if vocal_wav.shape[0] == 1:
101
+ vocal_wav = vocal_wav.expand(accomp_wav.shape[0], -1)
102
+ else:
103
+ accomp_wav = accomp_wav.expand(vocal_wav.shape[0], -1)
104
+
105
+ min_len = min(vocal_wav.shape[1], accomp_wav.shape[1])
106
+ mixed = vocal_wav[:, :min_len] * vocal_gain + accomp_wav[:, :min_len]
107
+
108
+ peak = mixed.abs().max()
109
+ if peak > 1.0:
110
+ mixed = mixed / peak
111
+
112
+ out_path = os.path.join(tempfile.mkdtemp(), "mixed_output.wav")
113
+ torchaudio.save(out_path, mixed, sample_rate=vocal_sr)
114
+ return out_path
115
+
116
+
117
+ # ---------------------------------------------------------------------------
118
+ # Main inference pipeline
119
+ # ---------------------------------------------------------------------------
120
+ def synthesize(args):
121
+ actual_seed = args.seed if args.seed >= 0 else random.randint(0, 2**31 - 1)
122
+ print(f"[INFO] Using seed: {actual_seed}")
123
+
124
+ actual_ref_path = args.ref_audio
125
+ actual_melody_path = args.melody_audio
126
+ melody_accomp_path = None
127
+
128
+ # Step 1: Vocal separation (optional)
129
+ if args.separate_vocals:
130
+ print("[INFO] Separating vocals from reference audio...")
131
+ actual_ref_path, _ = separate_vocals(args.ref_audio)
132
+
133
+ print("[INFO] Separating vocals from melody audio...")
134
+ actual_melody_path, melody_accomp_path = separate_vocals(args.melody_audio)
135
+
136
+ # Step 2: Model inference
137
+ print("[INFO] Loading model...")
138
+ model = get_model()
139
+
140
+ print("[INFO] Running synthesis...")
141
+ audio_tensor, sr = model(
142
+ ref_audio_path=actual_ref_path,
143
+ melody_audio_path=actual_melody_path,
144
+ ref_text=args.ref_text.strip(),
145
+ target_text=args.target_text.strip(),
146
+ lrc_align_mode="sentence_level",
147
+ sil_len_to_end=args.sil_len_to_end,
148
+ t_shift=args.t_shift,
149
+ nfe_step=args.nfe_step,
150
+ cfg_strength=args.cfg_strength,
151
+ seed=actual_seed,
152
+ )
153
+
154
+ vocal_out_path = os.path.join(tempfile.mkdtemp(), "vocal_output.wav")
155
+ torchaudio.save(vocal_out_path, audio_tensor.to("cpu"), sample_rate=sr)
156
+
157
+ # Step 3: Mix accompaniment (optional)
158
+ if args.separate_vocals and args.mix_accompaniment and melody_accomp_path is not None:
159
+ print("[INFO] Mixing vocals with accompaniment...")
160
+ final_path = mix_vocal_and_accompaniment(vocal_out_path, melody_accomp_path)
161
+ else:
162
+ final_path = vocal_out_path
163
+
164
+ # Write to specified output path
165
+ out_wav, out_sr = torchaudio.load(final_path)
166
+ os.makedirs(os.path.dirname(os.path.abspath(args.output)), exist_ok=True)
167
+ torchaudio.save(args.output, out_wav, sample_rate=out_sr)
168
+ print(f"[INFO] Saved to: {args.output}")
169
+
170
+
171
+ # ---------------------------------------------------------------------------
172
+ # Argument parser
173
+ # ---------------------------------------------------------------------------
174
+ def parse_args():
175
+ parser = argparse.ArgumentParser(
176
+ description="YingMusic Singer - Single sample command line inference"
177
+ )
178
+
179
+ # Required
180
+ parser.add_argument("--ref_audio", required=True,
181
+ help="Reference audio path")
182
+ parser.add_argument("--melody_audio", required=True,
183
+ help="Melody audio path")
184
+ parser.add_argument("--ref_text", required=True,
185
+ help="Reference lyrics, use | to separate phrases")
186
+ parser.add_argument("--target_text", required=True,
187
+ help="Target lyrics, use | to separate phrases")
188
+
189
+ # Output
190
+ parser.add_argument("--output", default="output.wav",
191
+ help="Output wav path (default: output.wav)")
192
+
193
+ # Optional flags
194
+ parser.add_argument("--separate_vocals", action="store_true",
195
+ help="Separate vocals before synthesis")
196
+ parser.add_argument("--mix_accompaniment", action="store_true",
197
+ help="Mix accompaniment into output (requires --separate_vocals)")
198
+
199
+ # Advanced params
200
+ parser.add_argument("--nfe_step", type=int, default=32,
201
+ help="NFE steps (default: 32)")
202
+ parser.add_argument("--cfg_strength", type=float, default=3.0,
203
+ help="CFG strength (default: 3.0)")
204
+ parser.add_argument("--t_shift", type=float, default=0.5,
205
+ help="t-shift (default: 0.5)")
206
+ parser.add_argument("--sil_len_to_end", type=float, default=0.5,
207
+ help="Silence padding in seconds (default: 0.5)")
208
+ parser.add_argument("--seed", type=int, default=-1,
209
+ help="Random seed, -1 for random (default: -1)")
210
+
211
+ return parser.parse_args()
212
+
213
+
214
+ if __name__ == "__main__":
215
+ args = parse_args()
216
+ synthesize(args)
inference_mp.py ADDED
@@ -0,0 +1,324 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ YingMusicSinger 批量推理脚本
3
+ 支持多卡多进程、进度条显示
4
+ 输入支持 JSONL 文件 或 LyricEditBench 数据集
5
+
6
+ 用法:
7
+ # JSONL 输入,4卡
8
+ python batch_infer.py \
9
+ --input_type jsonl \
10
+ --input_path /path/to/input.jsonl \
11
+ --output_dir /path/to/output \
12
+ --ckpt_path /path/to/ckpts \
13
+ --num_gpus 4
14
+
15
+ # LyricEditBench 输入
16
+ python batch_infer.py \
17
+ --input_type lyric_edit_bench \
18
+ --output_dir /path/to/output \
19
+ --ckpt_path /path/to/ckpts \
20
+ --num_gpus 4
21
+ """
22
+
23
+ import argparse
24
+ import json
25
+ import os
26
+ import sys
27
+ import traceback
28
+ from pathlib import Path
29
+
30
+ import torch
31
+ import torch.multiprocessing as mp
32
+ import torchaudio
33
+ from datasets import Audio, Dataset
34
+ from huggingface_hub import hf_hub_download
35
+ from tqdm import tqdm
36
+
37
+
38
+ def load_jsonl(path: str) -> list[dict]:
39
+ items = []
40
+ with open(path, "r", encoding="utf-8") as f:
41
+ for line in f:
42
+ line = line.strip()
43
+ if line:
44
+ items.append(json.loads(line))
45
+ return items
46
+
47
+
48
+ def build_dataset_from_local(gtsinger_root: str):
49
+ """
50
+ Build LyricEditBench dataset using your local GTSinger directory.
51
+
52
+ Args:
53
+ gtsinger_root: Root directory of your local GTSinger dataset.
54
+ """
55
+ # Download the inherited metadata from HuggingFace
56
+ json_path = hf_hub_download(
57
+ repo_id="ASLP-lab/LyricEditBench",
58
+ filename="GTSinger_Inherited.json",
59
+ repo_type="dataset",
60
+ )
61
+
62
+ with open(json_path, "r") as f:
63
+ data = json.load(f)
64
+
65
+ gtsinger_root = str(Path(gtsinger_root).resolve())
66
+
67
+ # Prepend local root to relative paths
68
+ for item in data:
69
+ item["melody_ref_path"] = os.path.join(gtsinger_root, item["melody_ref_path"])
70
+ item["timbre_ref_path"] = os.path.join(gtsinger_root, item["timbre_ref_path"])
71
+ # Set audio fields to the resolved file paths
72
+ item["melody_ref_audio"] = item["melody_ref_path"]
73
+ item["timbre_ref_audio"] = item["timbre_ref_path"]
74
+
75
+ # Build HuggingFace Dataset with Audio features
76
+ ds = Dataset.from_list(data)
77
+ ds = ds.cast_column("melody_ref_audio", Audio())
78
+ ds = ds.cast_column("timbre_ref_audio", Audio())
79
+
80
+ return ds
81
+
82
+
83
+ def load_subset(data: list, subset_id: str) -> list:
84
+ """Filter dataset by a subset ID list."""
85
+ subset_path = hf_hub_download(
86
+ repo_id="ASLP-lab/LyricEditBench",
87
+ filename=f"id_lists/{subset_id}.txt",
88
+ repo_type="dataset",
89
+ )
90
+
91
+ with open(subset_path, "r") as f:
92
+ id_set = set(line.strip() for line in f if line.strip())
93
+
94
+ return [item for item in data if item["id"] in id_set]
95
+
96
+
97
+ def load_lyric_edit_bench(input_type) -> list[dict]:
98
+ # If you have GTsinger downloaded, use this:
99
+
100
+ ds_full = build_dataset_from_local(
101
+ "/user-fs/chenzihao/zhengjunjie/datas/Music/openvocaldata/GTSinger"
102
+ )
103
+
104
+ # else, you kan use this:
105
+ # from datasets import load_dataset
106
+
107
+ # ds_full = load_dataset("ASLP-lab/LyricEditBench", split="test")
108
+
109
+ # ds_full loaded
110
+
111
+ subset_1k = load_subset(ds_full, "1K")
112
+ print(f"Loaded {len(subset_1k)} items")
113
+
114
+ items = []
115
+ for row in subset_1k:
116
+ if input_type == "lyric_edit_bench_melody_control":
117
+ items.append(
118
+ {
119
+ "id": row.get("id", ""),
120
+ "melody_ref_path": row.get("melody_ref_path", ""),
121
+ "gen_text": row.get("gen_text", ""),
122
+ "timbre_ref_path": row.get("timbre_ref_path", ""),
123
+ "timbre_ref_text": row.get("timbre_ref_text", ""),
124
+ }
125
+ )
126
+ elif input_type == "lyric_edit_bench_sing_edit":
127
+ items.append(
128
+ {
129
+ "id": row.get("id", ""),
130
+ "melody_ref_path": row.get("melody_ref_path", ""),
131
+ "gen_text": row.get("gen_text", ""),
132
+ "timbre_ref_path": row.get("melody_ref_path", ""),
133
+ "timbre_ref_text": row.get("melody_ref_text", ""),
134
+ }
135
+ )
136
+ else:
137
+ assert 0
138
+ return items
139
+
140
+
141
+ def worker(
142
+ rank: int,
143
+ world_size: int,
144
+ items: list[dict],
145
+ output_dir: str,
146
+ ckpt_path: str,
147
+ args: argparse.Namespace,
148
+ ):
149
+ """每个 GPU 上运行的 worker 进程"""
150
+ device = f"cuda:{rank}"
151
+ torch.cuda.set_device(rank)
152
+
153
+ # ---- 加载模型 ----
154
+ from src.YingMusicSinger.infer.YingMusicSinger import YingMusicSinger
155
+
156
+ model = YingMusicSinger.from_pretrained(ckpt_path)
157
+ model.to(device)
158
+ model.eval()
159
+
160
+ # ---- 分片: 每个 worker 处理自己那份 ----
161
+ shard = items[rank::world_size]
162
+
163
+ # ---- 只在 rank 0 显示进度条 ----
164
+ pbar = tqdm(
165
+ shard,
166
+ desc=f"[GPU {rank}]",
167
+ position=rank,
168
+ leave=True,
169
+ disable=(rank != 0 and not args.show_all_progress),
170
+ )
171
+
172
+ success, fail = 0, 0
173
+ for item in pbar:
174
+ item_id = item.get("id", f"unknown_{success + fail}")
175
+ out_path = os.path.join(output_dir, f"{item_id}.wav")
176
+
177
+ # 跳过已存在的文件
178
+ if os.path.exists(out_path) and not args.overwrite:
179
+ success += 1
180
+ pbar.set_postfix(ok=success, err=fail)
181
+ continue
182
+
183
+ try:
184
+ with torch.no_grad():
185
+ audio, sr = model(
186
+ ref_audio_path=item["timbre_ref_path"],
187
+ melody_audio_path=item["melody_ref_path"],
188
+ ref_text=item.get("timbre_ref_text", ""),
189
+ target_text=item.get("gen_text", ""),
190
+ lrc_align_mode=args.lrc_align_mode,
191
+ sil_len_to_end=args.sil_len_to_end,
192
+ t_shift=args.t_shift,
193
+ nfe_step=args.nfe_step,
194
+ cfg_strength=args.cfg_strength,
195
+ seed=args.seed
196
+ if args.seed != -1
197
+ else torch.randint(0, 2**32, (1,)).item(),
198
+ )
199
+
200
+ torchaudio.save(out_path, audio, sample_rate=sr)
201
+ success += 1
202
+
203
+ except Exception as e:
204
+ fail += 1
205
+ print(f"\n[GPU {rank}] ERROR on {item_id}: {e}", file=sys.stderr)
206
+ if args.verbose:
207
+ traceback.print_exc()
208
+
209
+ pbar.set_postfix(ok=success, err=fail)
210
+
211
+ pbar.close()
212
+ print(f"[GPU {rank}] Done. success={success}, fail={fail}")
213
+
214
+
215
+ def main():
216
+ parser = argparse.ArgumentParser(description="YingMusicSinger 批量推理")
217
+
218
+ # ---- 输入 ----
219
+ parser.add_argument(
220
+ "--input_type",
221
+ type=str,
222
+ required=True,
223
+ choices=[
224
+ "jsonl",
225
+ "lyric_edit_bench_melody_control",
226
+ "lyric_edit_bench_sing_edit",
227
+ ],
228
+ help="输入类型: jsonl / lyric_edit_bench_melody_control 或 lyric_edit_bench_sing_edit",
229
+ )
230
+ parser.add_argument(
231
+ "--input_path",
232
+ type=str,
233
+ default=None,
234
+ help="JSONL 文件路径 (input_type=jsonl 时必填)",
235
+ )
236
+
237
+ # ---- 输出 ----
238
+ parser.add_argument(
239
+ "--output_dir",
240
+ type=str,
241
+ required=True,
242
+ help="输出目录",
243
+ )
244
+
245
+ # ---- 模型 ----
246
+ parser.add_argument(
247
+ "--ckpt_path",
248
+ type=str,
249
+ required=False,
250
+ help="模型 checkpoint 路径 (save_pretrained 保存的目录)",
251
+ default=None,
252
+ )
253
+
254
+ # ---- 推理参数 ----
255
+ parser.add_argument(
256
+ "--num_gpus", type=int, default=None, help="使用 GPU 数量,默认全部"
257
+ )
258
+ parser.add_argument(
259
+ "--lrc_align_mode",
260
+ type=str,
261
+ default="sentence_level",
262
+ choices=["sentence_level"],
263
+ )
264
+ parser.add_argument("--sil_len_to_end", type=float, default=0.5)
265
+ parser.add_argument("--t_shift", type=float, default=0.5)
266
+ parser.add_argument("--nfe_step", type=int, default=32)
267
+ parser.add_argument("--cfg_strength", type=float, default=3.0)
268
+ parser.add_argument("--seed", type=int, default=-1)
269
+
270
+ # ---- 其它 ----
271
+ parser.add_argument("--overwrite", action="store_true", help="覆盖已有输出文件")
272
+ parser.add_argument(
273
+ "--show_all_progress", action="store_true", help="所有 GPU 都显示进度条"
274
+ )
275
+ parser.add_argument("--verbose", action="store_true", help="打印详细错误信息")
276
+
277
+ args = parser.parse_args()
278
+
279
+ # ---- 校验 ----
280
+ if args.input_type == "jsonl":
281
+ assert args.input_path is not None, "--input_path 是 jsonl 模式下必填的"
282
+ assert os.path.isfile(args.input_path), f"文件不存在: {args.input_path}"
283
+
284
+ # ---- 加载数据 ----
285
+ print("加载数据...")
286
+ if args.input_type == "jsonl":
287
+ items = load_jsonl(args.input_path)
288
+ else:
289
+ items = load_lyric_edit_bench(args.input_type)
290
+ print(f"共 {len(items)} 条数据")
291
+
292
+ # ---- 确定 GPU 数量 ----
293
+ available_gpus = torch.cuda.device_count()
294
+ num_gpus = args.num_gpus or available_gpus
295
+ num_gpus = min(num_gpus, available_gpus, len(items))
296
+ assert num_gpus > 0, "没有可用的 GPU"
297
+ print(f"使用 {num_gpus} 张 GPU")
298
+
299
+ # ---- 创建输出目录 ----
300
+ os.makedirs(args.output_dir, exist_ok=True)
301
+
302
+ # ---- 启动多进程 ----
303
+ if num_gpus == 1:
304
+ # 单卡直接跑,不需要 spawn
305
+ worker(0, 1, items, args.output_dir, args.ckpt_path, args)
306
+ else:
307
+ mp.set_start_method("spawn", force=True)
308
+ processes = []
309
+ for rank in range(num_gpus):
310
+ p = mp.Process(
311
+ target=worker,
312
+ args=(rank, num_gpus, items, args.output_dir, args.ckpt_path, args),
313
+ )
314
+ p.start()
315
+ processes.append(p)
316
+
317
+ for p in processes:
318
+ p.join()
319
+
320
+ print(f"\n推理完成! 输出目录: {args.output_dir}")
321
+
322
+
323
+ if __name__ == "__main__":
324
+ main()
inference_mp.sh ADDED
@@ -0,0 +1,41 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # JSONL 输入
2
+ # python inference_mp.py \
3
+ # --input_type jsonl \
4
+ # --input_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out/input_jsonl.jsonl \
5
+ # --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out/Jsonl \
6
+ # --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts \
7
+ # --num_gpus 8 \
8
+ # --show_all_progress
9
+
10
+
11
+ # # LyricEditBench 输入:
12
+ # python inference_mp.py \
13
+ # --input_type lyric_edit_bench_melody_control \
14
+ # --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out2/LyricEditBench_melody_control \
15
+ # --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts \
16
+ # --num_gpus 8 \
17
+ # --overwrite
18
+
19
+
20
+ # python inference_mp.py \
21
+ # --input_type lyric_edit_bench_sing_edit \
22
+ # --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out2/LyricEditBench_sing_edit \
23
+ # --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts \
24
+ # --num_gpus 8 \
25
+ # --overwrite
26
+
27
+
28
+ python inference_mp.py \
29
+ --input_type lyric_edit_bench_melody_control \
30
+ --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out4/LyricEditBench_melody_control \
31
+ --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts-ema \
32
+ --num_gpus 8 \
33
+ --overwrite
34
+
35
+
36
+ # python inference_mp.py \
37
+ # --input_type lyric_edit_bench_sing_edit \
38
+ # --output_dir /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer/temp_out3/LyricEditBench_sing_edit \
39
+ # --ckpt_path /user-fs/chenzihao/aslp_music/haochunbo/final/YingMusic-Singer-ckpts-ema-fix-extra-sli \
40
+ # --num_gpus 8 \
41
+ # --overwrite