audiolm-pytorch

AudioLM - Pytorch

实现 <a href="https://google-research.github.io/seanet/audiolm/examples/">AudioLM</a>，一种基于语言建模的方法，用于 Google Research 的音频生成，用 Pytorch 实现。

它还扩展了使用 T5 模型进行分类器自由引导的条件化工作。这允许进行文本到音频或 TTS，在原始论文中并未提供。是的，这意味着 <a href="https://valle-demo.github.io/">VALL-E</a> 可以通过这个仓库进行训练。本质上是一样的。

如果您有兴趣在开放环境中复制这项工作，请加入 <a href="https://discord.gg/xBPBXfcFHd"><img alt="Join us on Discord" src="https://yellow-cdn.veclightyear.com/35dd4d3f/e638488d-85b2-401e-9f7c-4300f0371d5a.png"></a>

该仓库现在还包含了 <a href="https://arxiv.org/abs/2107.03312">SoundStream</a> 的 MIT 许可版本。它还兼容 <a href="https://github.com/facebookresearch/encodec">EnCodec</a>，其在撰写本文时也是 MIT-许可的。

更新：AudioLM 基本上被用来解决新的 <a href="https://github.com/lucidrains/musiclm-pytorch">MusicLM</a> 中的音乐生成问题。

将来，<a href="https://www.youtube.com/watch?v=olNvmUCmY8o">这段电影剪辑</a>将不再有任何意义。您只需提示 AI 即可。

感谢

<a href="https://stability.ai/">Stability.ai</a> 赞助我们开展并开源前沿的人工智能研究。
<a href="https://huggingface.co/">🤗 Huggingface</a> 提供了出色的 accelerate 和 transformers 库。
<a href="https://ai.facebook.com/">MetaAI</a> 提供了 <a href="https://github.com/facebookresearch/fairseq">Fairseq</a> 和宽松的许可证。
<a href="https://github.com/eonglints">@eonglints</a> 和 <a href="https://github.com/turian">Joseph</a> 提供了他们的专业建议和专业知识以及提交的拉取请求！
<a href="https://github.com/djqualia">@djqualia</a>，<a href="https://github.com/yigityu">@yigityu</a>，<a href="https://github.com/inspirit">@inspirit</a> 和 <a href="https://github.com/BlackFox1197">@BlackFox1197</a> 帮助调试了 soundstream。
<a href="https://github.com/zhvng">Allen</a> 和 <a href="https://github.com/LWprogramming">LWprogramming</a> 审核代码并提交了修复错误。
<a href="https://github.com/ilya16">Ilya</a> 发现了多尺度判别器下采样的问题，并改进了 soundstream 训练。
<a href="https://github.com/AndreyBocharnikov">Andrey</a> 发现 soundstream 中缺少的损失，并指导我正确的 mel 频谱超参数设置。
<a href="https://github.com/alexdemartos">Alejandro</a> 和 <a href="https://github.com/ilya16">Ilya</a> 分享了他们的 soundstream 训练结果，并解决了局部注意力位置嵌入的一些问题。
<a href="https://github.com/LWprogramming">LWprogramming</a> 添加了 Encodec 兼容性！
<a href="https://github.com/LWprogramming">LWprogramming</a> 发现采样 FineTransformer 时处理 EOS 令牌的问题。
<a href="https://github.com/YoungloLee">@YoungloLee</a> 发现了 soundstream 中与填充不考虑步长有关的一维因果卷积的大错误。
<a href="https://github.com/haydenshively">Hayden</a> 指出 Soundstream 多尺度判别器的一些差异。

安装

$ pip install audiolm-pytorch

用法

SoundStream 和 Encodec

有两种选择神经编解码器。如果您想使用预训练的 24kHz Encodec，只需创建如下的 Encodec 对象：

from audiolm_pytorch import EncodecWrapper
encodec = EncodecWrapper()
# 现在您可以像使用下面的 soundstream 变量一样使用 encodec 变量。

否则，为了更忠实于原始论文，您可以使用 SoundStream。首先，SoundStream 需要在大量音频数据上进行训练：

from audiolm_pytorch import SoundStream, SoundStreamTrainer

soundstream = SoundStream(
    codebook_size = 4096,
    rq_num_quantizers = 8,
    rq_groups = 2,                       # 本文提出使用多头残差矢量量化 - https://arxiv.org/abs/2305.02765
    use_lookup_free_quantizer = True,    # 是否使用残差查找自由量化 - 现在有报告称成功使用了这一未发表的技术
    use_finite_scalar_quantizer = False, # 是否使用残差有限标量量化
    attn_window_size = 128,              # 在瓶颈处的本地注意力感受野
    attn_depth = 2                       # 2 个本地注意力变压器块 - soundstream 作者不是注意力的专家，因此我添加了一些。encodec 使用 lstms，但注意力应该更好
)

trainer = SoundStreamTrainer(
    soundstream,
    folder = '/path/to/audio/files',
    batch_size = 4,
    grad_accum_every = 8,         # 有效批量大小为 32
    data_max_length_seconds = 2,  # 训练2秒的音频
    num_train_steps = 1_000_000
).cuda()

trainer.train()

# 经过大量训练后，您可以如下测试自动编码

soundstream.eval() # 您的 soundstream 必须处于 eval 模式，以避免训练所需的残差 VQ 的残差 dropout

audio = torch.randn(10080).cuda()
recons = soundstream(audio, return_recons_only = True) # (1, 10080) - 1 频道

您的训练 SoundStream 可以用作通用音频分词器。


audio = torch.randn(1, 512 * 320)

codes = soundstream.tokenize(audio)

# 现在您可以使用代码本标识符训练任何东西

recon_audio_from_codes = soundstream.decode_from_codebook_indices(codes)

# 校验

assert torch.allclose(
    recon_audio_from_codes,
    soundstream(audio, return_recons_only = True)
)

您还可以通过分别导入 AudioLMSoundStream 和 MusicLMSoundStream 来使用特定于 AudioLM 和 MusicLM 的 soundstream。

from audiolm_pytorch import AudioLMSoundStream, MusicLMSoundStream

soundstream = AudioLMSoundStream(...) # 假设您想要 Audio LM 论文中的超参数

# 其余部分与上述相同

自 0.17.0 版本起，您现在可以在 SoundStream 上调用类方法，从 checkpoint 文件加载，而无需记住您的配置。

from audiolm_pytorch import SoundStream

soundstream = SoundStream.init_and_load_from('./path/to/checkpoint.pt')

要使用 <a href="https://wandb.ai">Weights & Biases</a> 跟踪，首先在 SoundStreamTrainer 上设置 use_wandb_tracking = True，然后执行以下操作：


trainer = SoundStreamTrainer(
    soundstream,
    ...,
    use_wandb_tracking = True
)

# 上下文管理器包装 .train()，指定项目和跑步名称

with trainer.wandb_tracker(project = 'soundstream', run = 'baseline'):
    trainer.train()

分层变压器

然后需要训练三个单独的变压器（SemanticTransformer，CoarseTransformer，FineTransformer）。

例如：SemanticTransformer

import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

# hubert checkpoints 可在以下位置下载：
# https://github.com/facebookresearch/fairseq/tree/main/examples/hubert

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    dim = 1024,
    depth = 6,
    flash_attn = True
).cuda()


trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    folder ='/path/to/audio/files',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1
)

trainer.train()

例如：CoarseTransformer

import torch
from audiolm_pytorch import HubertWithKmeans, SoundStream, CoarseTransformer, CoarseTransformerTrainer

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

soundstream = SoundStream.init_and_load_from('/path/to/trained/soundstream.pt')

coarse_transformer = CoarseTransformer(
    num_semantic_tokens = wav2vec.codebook_size,
    codebook_size = 1024,
    num_coarse_quantizers = 3,
    dim = 512,
    depth = 6,
    flash_attn = True
)

trainer = CoarseTransformerTrainer(
    transformer = coarse_transformer,
    codec = soundstream,
    wav2vec = wav2vec,
    folder = '/path/to/audio/files',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1_000_000
)

trainer.train()

例如：FineTransformer

import torch
from audiolm_pytorch import SoundStream, FineTransformer, FineTransformerTrainer

soundstream = SoundStream.init_and_load_from('/path/to/trained/soundstream.pt')

fine_transformer = FineTransformer(
    num_coarse_quantizers = 3,
    num_fine_quantizers = 5,
    codebook_size = 1024,
    dim = 512,
    depth = 6,
    flash_attn = True
)

trainer = FineTransformerTrainer(
    transformer = fine_transformer,
    codec = soundstream,
    folder = '/path/to/audio/files',
    batch_size = 1,
    data_max_length = 320 * 32,
    num_train_steps = 1_000_000
)

trainer.train()

现在合在一起

from audiolm_pytorch import AudioLM

audiolm = AudioLM(
    wav2vec = wav2vec,
    codec = soundstream,
    semantic_transformer = semantic_transformer,
    coarse_transformer = coarse_transformer,
    fine_transformer = fine_transformer
)

generated_wav = audiolm(batch_size = 1)

# 或带提示生成

generated_wav_with_prime = audiolm(prime_wave = torch.randn(1, 320 * 8))

# 或有文本条件时

generated_wav_with_text_condition = audiolm(text = ['鸟鸣和远处钟声的回声'])

文本条件的音频合成

更新：看起来这是可行的，参见 <a href="https://valle-demo.github.io/">'VALL-E'</a>

例如：语义变压器

import torch
from audiolm_pytorch import HubertWithKmeans, SemanticTransformer, SemanticTransformerTrainer

wav2vec = HubertWithKmeans(
    checkpoint_path = './hubert/hubert_base_ls960.pt',
    kmeans_path = './hubert/hubert_base_ls960_L9_km500.bin'
)

semantic_transformer = SemanticTransformer(
    num_semantic_tokens = 500,
    dim = 1024,
    depth = 6,
    has_condition = True,               # 这必须设置为 True
    cond_as_self_attn_prefix = True     # 是否将条件作为自注意力前缀，而不是交叉注意力，正如'VALL-E'论文中所述
).cuda()

# 模拟文本音频数据集（作为示例）

# 您将需要从 `Dataset` 扩展自己的数据集，并返回音频张量以及字符串（音频描述），任何顺序（框架将自动检测并路由到变压器中）

from torch.utils.data import Dataset

class MockTextAudioDataset(Dataset):
    def __init__(self, length = 100, audio_length = 320 * 32):
        super().__init__()
        self.audio_length = audio_length
        self.len = length

    def __len__(self):
        return self.len

    def __getitem__(self, idx):
        mock_audio = torch.randn(self.audio_length)
        mock_caption = 'audio caption'
        return mock_caption, mock_audio

dataset = MockTextAudioDataset()

# 实例化语义变压器训练器并训练

trainer = SemanticTransformerTrainer(
    transformer = semantic_transformer,
    wav2vec = wav2vec,
    dataset = dataset,
    batch_size = 4,
    grad_accum_every = 8,
    data_max_length = 320 * 32,
    num_train_steps = 1_000_000
)

trainer.train()

# 经过大量训练后

sample = trainer.generate(text = ['屋顶上雨滴的声音'], batch_size = 1, max_length = 2) # (1, < 128) - 如果检测到 [eos] 可能会提前终止

多 GPU

由于所有的训练器类都使用了 <a href="https://huggingface.co/docs/accelerate/accelerator">🤗 Accelerator</a>，您可以轻松地使用 accelerate 命令进行多 GPU 训练。

在项目根目录

$ accelerate config

然后，在相同目录中

$ accelerate launch train.py

待办事项

@misc{https://doi.org/10.48550/arxiv.2107.03312,
  title  = {SoundStream: An End-to-End Neural Audio Codec},
  author = {Zeghidour, Neil and Luebs, Alejandro and Omran, Ahmed and Skoglund, Jan and Tagliasacchi, Marco},
  publisher = {arXiv},
  url    = {https://arxiv.org/abs/2107.03312},
  year   = {2021}
}

@misc{shazeer2020glu,
    title   = {GLU Variants Improve Transformer},
    author  = {Noam Shazeer},
    year    = {2020},
    url     = {https://arxiv.org/abs/2002.05202}
}

@article{Shazeer2019FastTD,
    title   = {Fast Transformer Decoding: One Write-Head is All You Need},
    author  = {Noam M. Shazeer},
    journal = {ArXiv},
    year    = {2019},
    volume  = {abs/1911.02150}
}

@article{Ho2022ClassifierFreeDG,
    title   = {Classifier-Free Diffusion Guidance},
    author  = {Jonathan Ho},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2207.12598}
}

@misc{crowson2022,
    author  = {Katherine Crowson},
    url     = {https://twitter.com/rivershavewings}
}

@misc{ding2021cogview,
    title   = {CogView: Mastering Text-to-Image Generation via Transformers},
    author  = {Ming Ding and Zhuoyi Yang and Wenyi Hong and Wendi Zheng and Chang Zhou and Da Yin and Junyang Lin and Xu Zou and Zhou Shao and Hongxia Yang and Jie Tang},
    year    = {2021},
    eprint  = {2105.13290},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@article{Liu2022FCMFC,
    title   = {FCM: Forgetful Causal Masking Makes Causal Language Models Better Zero-Shot Learners},
    author  = {Hao Liu and Xinyang Geng and Lisa Lee and Igor Mordatch and Sergey Levine and Sharan Narang and P. Abbeel},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2210.13432}
}

@inproceedings{anonymous2022normformer,
    title   = {NormFormer: Improved Transformer Pretraining with Extra Normalization},
    author  = {Anonymous},
    booktitle = {Submitted to The Tenth International Conference on Learning Representations },
    year    = {2022},
    url     = {https://openreview.net/forum?id=GMYWzWztDx5},
    note    = {under review}
}

@misc{liu2021swin,
    title   = {Swin Transformer V2: Scaling Up Capacity and Resolution},
    author  = {Ze Liu and Han Hu and Yutong Lin and Zhuliang Yao and Zhenda Xie and Yixuan Wei and Jia Ning and Yue Cao and Zheng Zhang and Li Dong and Furu Wei and Baining Guo},
    year    = {2021},
    eprint  = {2111.09883},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@article{Li2021LocalViTBL,
    title   = {LocalViT: Bringing Locality to Vision Transformers},
    author  = {Yawei Li and K. Zhang and Jie Cao and Radu Timofte and Luc Van Gool},
    journal = {ArXiv},
    year    = {2021},
    volume  = {abs/2104.05707}
}

@article{Defossez2022HighFN,
    title   = {High Fidelity Neural Audio Compression},
    author  = {Alexandre D'efossez and Jade Copet and Gabriel Synnaeve and Yossi Adi},
    journal = {ArXiv},
    year    = {2022},
    volume  = {abs/2210.13438}
}

@article{Hu2017SqueezeandExcitationN,
    title   = {Squeeze-and-Excitation Networks},
    author  = {Jie Hu and Li Shen and Gang Sun},
    journal = {2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition},
    year    = {2017},
    pages   = {7132-7141}
}

@inproceedings{Yang2023HiFiCodecGV,
    title   = {HiFi-Codec: Group-residual Vector quantization for High Fidelity Audio Codec},
    author  = {Dongchao Yang and Songxiang Liu and Rongjie Huang and Jinchuan Tian and Chao Weng and Yuexian Zou},
    year    = {2023}
}

@article{Kazemnejad2023TheIO,
    title   = {The Impact of Positional Encoding on Length Generalization in Transformers},
    author  = {Amirhossein Kazemnejad and Inkit Padhi and Karthikeyan Natesan Ramamurthy and Payel Das and Siva Reddy},
    journal = {ArXiv},
    year    = {2023},
    volume  = {abs/2305.19466}
}

@inproceedings{dao2022flashattention,
    title   = {Flash{A}ttention: Fast and Memory-Efficient Exact Attention with {IO}-Awareness},
    author  = {Dao, Tri and Fu, Daniel Y. and Ermon, Stefano and Rudra, Atri and R{\'e}, Christopher},
    booktitle = {Advances in Neural Information Processing Systems},
    year    = {2022}
}

@misc{yu2023language,
    title   = {Language Model Beats Diffusion -- Tokenizer is Key to Visual Generation},
    author  = {Lijun Yu and José Lezama and Nitesh B. Gundavarapu and Luca Versari and Kihyuk Sohn and David Minnen and Yong Cheng and Agrim Gupta and Xiuye Gu and Alexander G. Hauptmann and Boqing Gong and Ming-Hsuan Yang and Irfan Essa and David A. Ross and Lu Jiang},
    year    = {2023},
    eprint  = {2310.05737},
    archivePrefix = {arXiv},
    primaryClass = {cs.CV}
}

@inproceedings{Katsch2023GateLoopFD,
    title   = {GateLoop: Fully Data-Controlled Linear Recurrence for Sequence Modeling},
    author  = {Tobias Katsch},
    year    = {2023},
    url     = {https://api.semanticscholar.org/CorpusID:265018962}
}