Libriheavy: 一个50,000小时的ASR语料库，包含标点、大小写和上下文

这是Libriheavy数据集的官方代码库。Libriheavy是Librilight的标注版本。更多详细信息请参阅我们的论文：《Libriheavy: 一个50,000小时的ASR语料库，包含标点、大小写和上下文》。预印本可在arxiv上获取。

如何下载数据集

Libriheavy的音频文件与Librilight相同，音频文件可在此处获取，您可以通过以下命令下载：

bash run.sh --stage -1 --stop-stage -1

Libriheavy的清单文件托管在huggingface和modelscope（适用于中国大陆用户）。您可以通过以下方式下载清单文件：

从huggingface下载：

bash run.sh --stage 1 --stop-stage 1

或从modelscope下载：

bash run.sh --stage 0 --stop-stage 0

上面下载的清单文件格式如下，我们有两个版本的texts和pre_texts，第一项是原书的转录文本（包含大小写和标点），第二项是ASR模型的解码结果。第二项用于对齐原书中的转录文本，我们决定保留它。

{
  "id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb_0",
  "start": 243.919,
  "duration": 7.36,
  "channel": 0,
  "supervisions": [
    {
      "id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb_0",
      "recording_id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb",
      "start": 0,
      "duration": 7.36,
      "channel": 0,
      "language": "English",
      "speaker": "100",
      "custom": {
        "texts": [
          "The little girl was thoughtful for a moment. \"But why do folks dive in the water when the mermaids smile an' wink?\" she asked.",
          "THE LITTLE GIRL WAS THOUGHTFUL FOR A MOMENT BUT WHY DO FOLKS DIVE IN THE WATER WHEN THE MERMAIDS SMILE AND WINK SHE ASKED"
        ],
        "pre_texts": [                                                                                                                      
          "...us mortal folk,\" replied Cap'n Bill. \"But if anyone happens to see 'em, what then, Cap'n?\" \"Then,\" he answered, slowly wagging his head, \"the mermais give 'em a smile an' a wink, an' they dive into the water an' gets drownded.\" \"S'pose they knew how to swim, Cap'n Bill?\" \"That don't make any diff'rence, Trot. The mermaids live deep down, an' the poor mortals never come up again.",
          "...US MORTAL FOLK REPLIED CAP'N BILL BUT IF ANYONE HAPPENS TO SEE EM WHAT THEN CAP'N THEN HE ANSWERED SLOWLY WAGGING HIS HEAD THE MERMAIDS GIVE EM A SMILE AND A WINK AND THEY DIVES INTO THE WATER AND GETS DROWNDED S'POSE THEY KNOW HOW TO SWIM CAP'N BILL THAT DON'T MAKE ANY DIFFERENCE TROT THE MERMAIDS LIVE DEEP DOWN AND THE POOR MORTALS NEVER COME UP AGAIN"
        ],
        "begin_byte": 4993,
        "end_byte": 5120
      }
    }
  ],
  "recording": {
    "id": "small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb",
    "sources": [
      {
        "type": "file",
        "channels": [
          0
        ],
        "source": "download/librilight/small/100/sea_fairies_0812_librivox_64kb_mp3/01_baum_sea_fairies_64kb.flac"
      }
    ],
    "sampling_rate": 16000,
    "num_samples": 9567080,
    "duration": 597.942,
    "channel_ids": [
      0
    ]
  },
  "custom": {
    "text_path": "download/librilight_text/output_text_small_cleaned/Sea Fairies/text.txt"
  },
  "type": "MonoCut"
}

这是Libriheavy的完整版本，可用于各种语音任务。您可以通过以下命令进一步提取用于纯ASR训练的清单：

bash run.sh --stage 2 --stop-stage 2

现在，您有了k2格式（lhotse cuts）和kaldi格式的语料库，包括标准化版本（大写无标点）和全格式版本（大小写带标点）：

├── cases_and_punc
│   ├── kaldi
│   │   ├── large
│   │   │   ├── segments
│   │   │   ├── text
│   │   │   └── wav.scp
......
│   │   ├── test_clean
│   │   │   ├── segments
│   │   │   ├── text
│   │   │   └── wav.scp
│   └── lhotse
│       ├── libriheavy_cuts_dev.jsonl.gz
│       ├── libriheavy_cuts_large.jsonl.gz
│       ├── libriheavy_cuts_medium.jsonl.gz
│       ├── libriheavy_cuts_small.jsonl.gz
│       ├── libriheavy_cuts_test_clean.jsonl.gz
│       ├── libriheavy_cuts_test_clean_large.jsonl.gz
│       ├── libriheavy_cuts_test_other.jsonl.gz
│       └──  libriheavy_cuts_test_other_large.jsonl.gz
└── upper_no_punc
    ├── kaldi
    │   ├── large
    │   │   ├── segments
    │   │   ├── text
    │   │   └── wav.scp
    ......
    │   ├── test_other
    │   │   ├── segments
    │   │   ├── text
    │   │   └── wav.scp
    └── lhotse
        ├── libriheavy_cuts_dev.jsonl.gz
        ├── libriheavy_cuts_large.jsonl.gz
        ├── libriheavy_cuts_medium.jsonl.gz
        ├── libriheavy_cuts_small.jsonl.gz
        ├── libriheavy_cuts_test_clean.jsonl.gz
        ├── libriheavy_cuts_test_clean_large.jsonl.gz
        ├── libriheavy_cuts_test_other.jsonl.gz
        └── libriheavy_cuts_test_other_large.jsonl.gz

关于如何使用pre_texts，我们有一篇论文：《PromptASR用于上下文ASR和可控风格》预印本可在arxiv上获取

注意清单文件中音频文件的目录被硬编码为download/librilight。

排行榜

注意： 大子集=大 + 中 + 小；中子集 = 中 + 小（即大子集包括上面的大、中、小清单，中子集包括上面的中和小清单）。

使用标准化文本训练的模型

注意：使用Wenet训练的模型可能没有经过充分调优。

大子集

贡献者	工具包	LibriSpeech WER（干净 & 其他）	Libriheavy WER（干净 & 其他）	配方	模型
基线	Wenet	2.02 & 5.22	2.74 & 6.68	CTC + Attention	模型
基线	icefall	1.62 & 3.36	2.20 & 5.57	Transducer	模型

中子集

贡献者	工具包	LibriSpeech WER（干净 & 其他）	Libriheavy WER（干净 & 其他）	配方	模型
基线	Wenet	3.15 & 7.88	3.80 & 8.80	CTC + Attention	模型
基线	icefall	2.35 & 4.82	2.90 & 6.57	Transducer	模型

小子集

贡献者	工具包	LibriSpeech WER（干净 & 其他）	Libriheavy WER（干净 & 其他）	配方	模型
基线	Wenet	5.76 & 15.60	6.94 & 15.17	CTC + Attention	模型
基线	icefall	4.05 & 9.89	4.68 & 10.01	Transducer	模型

使用带大小写和标点的文本训练的模型

大子集

贡献者	工具包	Libriheavy标准化WER（干净 & 其他）	Libriheavy WER（干净 & 其他）	配方	模型
基线	icefall	2.28 & 5.68	7.76 & 11.32	Transducer	模型

中子集

贡献者	工具包	Libriheavy 归一化词错率 (clean & other)	Libriheavy 词错率 (clean & other)	配方	模型
基线	icefall	3.05 & 6.78	9.84 & 13.39	Transducer	模型

小规模子集

贡献者	工具包	Libriheavy 归一化词错率 (clean & other)	Libriheavy 词错率 (clean & other)	配方	模型
基线	icefall	5.16 & 11.12	13.04 & 19.54	Transducer	模型

统计数据

您可以在Librilight论文中找到语料库的详细描述，以下是Libriheavy的一些统计数据。最后7列是持续时间（以秒为单位）的分布。

子集	小时数	书籍数	每位说话人小时数	说话人总数	平均值	标准差	最小值	25%百分位	50%百分位	75%百分位	99%百分位
small	509	173	1.22	417	14.9	6.5	2.0	10	14.4	18.6	30.8
medium	5042	960	3.29	1531	14.8	6.4	2.0	9.9	14.3	18.5	30.8
large	50794	8592	7.54	6736	14.8	6.4	2.0	9.8	14.2	18.4	30.7
dev	22.3	180	0.16	141	15.0	6.5	2.1	10.1	14.5	18.6	30.8
test-clean	10.5	87	0.15	70	14.7	6.5	2.3	9.6	14.2	18.5	30.8
test-other	11.5	112	0.16	72	14.6	6.4	2.2	9.7	14.0	18.2	30.6
test-clean-large	107.5	95	1.49	72	14.8	6.4	2.0	9.9	14.3	18.4	30.8
test-other-large	100.3	136	1.37	73	14.6	6.5	2.0	9.7	14.0	18.4	30.8

创建流程

您可以在此处找到创建流程的文档。

引用

@misc{kang2023libriheavy,
      title={Libriheavy: a 50,000 hours ASR corpus with punctuation casing and context}, 
      author={Wei Kang and Xiaoyu Yang and Zengwei Yao and Fangjun Kuang and Yifan Yang and Liyong Guo and Long Lin and Daniel Povey},
      year={2023},
      eprint={2309.08105},
      archivePrefix={arXiv},
      primaryClass={eess.AS}
}