Vocos：缩小时域和基于傅里叶变换的神经声码器之间的差距，实现高质量音频合成

音频样本 | 论文 [摘要] [PDF]

Vocos 是一种快速神经声码器，旨在从声学特征合成音频波形。通过生成对抗网络（GAN）目标进行训练，Vocos 可以在单次前向传递中生成波形。与其他典型的基于 GAN 的声码器不同，Vocos 不在时域中建模音频样本。相反，它生成谱系数，通过傅里叶逆变换实现快速音频重建。

安装

仅在推理模式下使用 Vocos，请使用以下命令安装：

pip install vocos

如果您希望训练模型，请安装附加依赖项：

pip install vocos[train]

使用

从梅尔频谱图重建音频

import torch

from vocos import Vocos

vocos = Vocos.from_pretrained("charactr/vocos-mel-24khz")

mel = torch.randn(1, 100, 256)  # B, C, T
audio = vocos.decode(mel)

从文件进行复制合成：

import torchaudio

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # 混合为单声道
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)
y_hat = vocos(y)

从 EnCodec 标记重建音频

此外，您需要提供一个 bandwidth_id，它对应于以下列表中的带宽嵌入：[1.5, 3.0, 6.0, 12.0]。

vocos = Vocos.from_pretrained("charactr/vocos-encodec-24khz")

audio_tokens = torch.randint(low=0, high=1024, size=(8, 200))  # 8 个码本，200 帧
features = vocos.codes_to_features(audio_tokens)
bandwidth_id = torch.tensor([2])  # 6 kbps

audio = vocos.decode(features, bandwidth_id=bandwidth_id)

从文件进行复制合成：它使用 EnCodec 提取和量化特征，然后在单次前向传递中使用 Vocos 重建它们。

y, sr = torchaudio.load(YOUR_AUDIO_FILE)
if y.size(0) > 1:  # 混合为单声道
    y = y.mean(dim=0, keepdim=True)
y = torchaudio.functional.resample(y, orig_freq=sr, new_freq=24000)

y_hat = vocos(y, bandwidth_id=bandwidth_id)

与 🐶 Bark 文本到音频模型集成

请参阅示例笔记本。

预训练模型

模型名称	数据集	训练迭代次数	参数数量
charactr/vocos-mel-24khz	LibriTTS	1M	13.5M
charactr/vocos-encodec-24khz	DNS Challenge	2M	7.9M

训练

准备训练集和验证集的音频文件列表：

find $TRAIN_DATASET_DIR -name *.wav > filelist.train
find $VAL_DATASET_DIR -name *.wav > filelist.val

填写配置文件，例如 vocos.yaml，包含您的文件列表路径，然后开始训练：

python train.py -c configs/vocos.yaml

有关自定义训练流程的详细信息，请参阅 Pytorch Lightning 文档。

引用

如果此代码对您的研究有所贡献，请引用我们的工作：

@article{siuzdak2023vocos,
  title={Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis},
  author={Siuzdak, Hubert},
  journal={arXiv preprint arXiv:2306.00814},
  year={2023}
}