HuggingSound

HuggingSound：基于HuggingFace工具的语音相关任务工具包。

我并不打算在这里构建一个非常复杂的工具。我只是想为我的语音相关实验拥有一个易于使用的工具包。我希望这个库也能对其他人有所帮助 :)

要求

Python 3.8+

安装

$ pip install huggingsound

如何使用？

我会尝试总结这个工具包的用法。但以下文档中会遗漏很多内容。我承诺会尽快完善它。目前，如果你有任何问题，可以提出issue或查看源代码来了解它的工作原理。你可以在仓库的examples文件夹中查看更多使用示例。

语音识别

对于语音识别，你可以使用Hugging Face Hub上托管的任何CTC模型。你可以在这里找到一些可用的模型。

推理

from huggingsound import SpeechRecognitionModel

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/sagan.mp3", "/path/to/asimov.wav"]

transcriptions = model.transcribe(audio_paths)

print(transcriptions)

# 转录格式（一个字典列表，每个音频文件对应一个）：
# [
#  {
#   "transcription": "extraordinary claims require extraordinary evidence", 
#   "start_timestamps": [100, 120, 140, 180, ...],
#   "end_timestamps": [120, 140, 180, 200, ...],
#   "probabilities": [0.95, 0.88, 0.9, 0.97, ...]
# },
# ...]
#
# 如你所见，不仅返回了转录结果，还包括了转录中每个字符的时间戳（以毫秒为单位）和概率。

推理（通过语言模型增强）

from huggingsound import SpeechRecognitionModel, KenshoLMDecoder

model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")
audio_paths = ["/path/to/sagan.mp3", "/path/to/asimov.wav"]

# LM解码器使用的LM格式是KenLM格式（arpa或二进制文件）。
# 你可以从这里下载一些LM文件示例：https://huggingface.co/jonatasgrosman/wav2vec2-large-xlsr-53-english/tree/main/language_model
lm_path = "path/to/your/lm_files/lm.binary"
unigrams_path = "path/to/your/lm_files/unigrams.txt"

# 我们实现了三种不同的LM增强解码器：KenshoLMDecoder、ParlanceLMDecoder和FlashlightLMDecoder
# 在这个例子中，我们将使用KenshoLMDecoder
# 要使用此解码器，你需要先安装Kensho的ctcdecode（https://github.com/kensho-technologies/pyctcdecode）
decoder = KenshoLMDecoder(model.token_set, lm_path=lm_path, unigrams_path=unigrams_path)

transcriptions = model.transcribe(audio_paths, decoder=decoder)

print(transcriptions)

评估

from huggingsound import SpeechRecognitionModel
model = SpeechRecognitionModel("jonatasgrosman/wav2vec2-large-xlsr-53-english")

references = [
    {"path": "/path/to/sagan.mp3", "transcription": "非凡的主张需要非凡的证据"},
    {"path": "/path/to/asimov.wav", "transcription": "暴力是无能者最后的庇护所"},
]

evaluation = model.evaluate(references)

print(evaluation)

# 评估格式: {"wer": 0.08, "cer": 0.02}

微调

from huggingsound import TrainingArguments, ModelArguments, SpeechRecognitionModel, TokenSet

model = SpeechRecognitionModel("facebook/wav2vec2-large-xlsr-53")
output_dir = "my/finetuned/model/output/dir"

# 首先，你需要定义模型的标记集
# 但是，标记集只适用于未经微调的模型
# 如果你为已经微调过的模型传入新的标记集，在训练过程中会被忽略
tokens = ["a", "b", "c", "d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t", "u", "v", "w", "x", "y", "z", "'"]
token_set = TokenSet(tokens)

# 定义你的训练/评估数据
train_data = [
    {"path": "/path/to/sagan.mp3", "transcription": "非凡的主张需要非凡的证据"},
    {"path": "/path/to/asimov.wav", "transcription": "暴力是无能者最后的庇护所"},
]
eval_data = [
    {"path": "/path/to/sagan2.mp3", "transcription": "缺乏证据并不意味着没有证据"},
    {"path": "/path/to/asimov2.wav", "transcription": "真正的快乐在于发现而不是知道"},
]

# 最后，微调你的模型
model.finetune(
    output_dir, 
    train_data=train_data, 
    eval_data=eval_data, # eval_data是可选的
    token_set=token_set,
)

故障排除

如果你在加载MP3文件时遇到问题：$ sudo apt-get install ffmpeg

想要帮忙？

如果你想为HuggingSound项目做出贡献，请查看贡献指南。

你甚至不需要懂编程就能为项目做出贡献。改进我们的文档也是一种杰出的贡献。

如果这个项目对你有用，请与你的朋友分享。这个项目可能对他们也有帮助。

如果你喜欢这个项目并想激励维护者，请给我们一个:star:。这种认可会让我们对我们用:heart:完成的工作感到非常高兴。

你也可以赞助我 :heart_eyes:

引用

如果你想引用这个工具，可以使用以下格式：

@misc{grosman2022huggingsound,
  title={{HuggingSound：基于Hugging Face工具的语音相关任务工具包}},
  author={Grosman, Jonatas},
  howpublished={\url{https://github.com/jonatasgrosman/huggingsound}},
  year={2022}
}