Gruut

Gruut是一个支持多种人类语言的分词器、文本清洗器和国际音标(IPA)音素转换器，支持SSML。

from gruut import sentences

text = 'He wound it around the wound, saying "I read it was $10 to read."'

for sent in sentences(text, lang="en-us"):
    for word in sent:
        if word.phonemes:
            print(word.text, *word.phonemes)

输出结果如下：

He h ˈi
wound w ˈaʊ n d
it ˈɪ t
around ɚ ˈaʊ n d
the ð ə
wound w ˈu n d
, |
saying s ˈeɪ ɪ ŋ
I ˈaɪ
read ɹ ˈɛ d
it ˈɪ t
was w ə z
ten t ˈɛ n
dollars d ˈɑ l ɚ z
to t ə
read ɹ ˈi d
. ‖

注意，"wound"和"read"在不同（语法）上下文中有不同的发音。

还支持SSML的子集：

from gruut import sentences

ssml_text = """<?xml version="1.0" encoding="ISO-8859-1"?>
<speak version="1.1" xmlns="http://www.w3.org/2001/10/synthesis"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
    xml:lang="en-US">
<s>Today at 4pm, 2/1/2000.</s>
<s xml:lang="it">Un mese fà, 2/1/2000.</s>
</speak>"""

for sent in sentences(ssml_text, ssml=True):
    for word in sent:
        if word.phonemes:
            print(sent.idx, word.lang, word.text, *word.phonemes)

输出结果如下：

0 en-US Today t ə d ˈeɪ
0 en-US at ˈæ t
0 en-US four f ˈɔ ɹ
0 en-US P p ˈi
0 en-US M ˈɛ m
0 en-US , |
0 en-US February f ˈɛ b j u ˌɛ ɹ i
0 en-US first f ˈɚ s t
0 en-US , |
0 en-US two t ˈu
0 en-US thousand θ ˈaʊ z ə n d
0 en-US . ‖
1 it Un u n
1 it mese ˈm e s e
1 it fà f a
1 it , |
1 it due d j u
1 it gennaio d͡ʒ e n n ˈa j o
1 it duemila d u e ˈm i l a
1 it . ‖

更多详细信息请参阅文档。

安装

pip install gruut

除英语外的其他语言可以在安装时添加。例如，添加法语和意大利语支持：

pip install -f 'https://synesthesiam.github.io/prebuilt-apps/' gruut[fr,it]

额外的pip仓库是为了使用更新的num2words分支，其中包含对更多语言的支持。

你也可以手动下载语言文件并将它们放在$XDG_CONFIG_HOME/gruut/目录下（默认为$HOME/.config/gruut）。

如果未安装相应的Python包，gruut将在$XDG_CONFIG_HOME/gruut/<lang>/目录中查找语言文件。注意，这里的<lang>是完整的语言名称，例如de-de而不是仅de。

支持的语言

gruut目前支持：

阿拉伯语 (ar)
捷克语 (cs 或 cs-cz)
德语 (de 或 de-de)
英语 (en 或 en-us)
西班牙语 (es 或 es-es)
波斯语 (fa)
法语 (fr 或 fr-fr)
意大利语 (it 或 it-it)
卢森堡语 (lb)
荷兰语 (nl)
俄语 (ru 或 ru-ru)
瑞典语 (sv 或 sv-se)
斯瓦希里语 (sw)

目标是支持所有voice2json的语言

依赖项

Python 3.7或更高版本
Linux
- 在Debian Bullseye上测试通过
num2words分支和Babel
- 货币/数字处理
- num2words分支包括额外的语言支持（阿拉伯语、波斯语、瑞典语、斯瓦希里语）
gruut-ipa
- IPA发音操作
pycrfsuite
- 词性标注和音素到音素模型
pydateparser
- 多语言日期解析

数字、日期等

gruut可以自动将数字、日期和其他表达式转换为文字。这是以区域感知的方式进行解析和转换的，因此"1/1/2020"可能会被解释为"月/日/年"或"日/月/年"，具体取决于单词或句子的语言（例如，<s lang="...">）。

gruut可以自动将以下类型的表达式扩展为文字：

数字 - "123"转换为"一百二十三"（使用verbalize_numbers=False或--no-numbers禁用）
- 依赖Babel进行解析，num2words进行转换
日期 - "1/1/2020"转换为"二零二零年一月一日"（使用verbalize_dates=False或--no-dates禁用）
- 依赖pydateparser进行解析，Babel和num2words进行转换
货币 - "$10"转换为"十美元"（使用verbalize_currency=False或--no-currency禁用）
- 依赖Babel进行解析，Babel和num2words进行转换
时间 - "12:01am"转换为"凌晨十二点零一分"（使用verbalize_times=False或--no-times禁用）
- 仅支持英语
- 依赖num2words进行转换

命令行使用

gruut 模块可以通过 python3 -m gruut --language <语言> <文本> 或使用 gruut 命令（来自 setup.py）执行。

gruut 命令是面向行的，消耗文本并生成 JSONL。你可能需要安装 jq 来操作 gruut 输出的 JSONL。

纯文本

接收原始文本并输出带有清理过的单词/标记的 JSONL。

echo 'This, right here, is some "RAW" text!' \
   | gruut --language en-us \
   | jq --raw-output '.words[].text'
This
,
right
here
,
is
some
"
RAW
"
text
!

完整的 JSON 输出中提供了更多信息：

gruut --language en-us 'More  text.' | jq .

输出：

{
  "idx": 0,
  "text": "More text.",
  "text_with_ws": "More text.",
  "text_spoken": "More text",
  "par_idx": 0,
  "lang": "en-us",
  "voice": "",
  "words": [
    {
      "idx": 0,
      "text": "More",
      "text_with_ws": "More ",
      "leading_ws": "",
      "training_ws": " ",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "JJR",
      "phonemes": [
        "m",
        "ˈɔ",
        "ɹ"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 1,
      "text": "text",
      "text_with_ws": "text",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": "NN",
      "phonemes": [
        "t",
        "ˈɛ",
        "k",
        "s",
        "t"
      ],
      "is_major_break": false,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": false,
      "is_spoken": true,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    },
    {
      "idx": 2,
      "text": ".",
      "text_with_ws": ".",
      "leading_ws": "",
      "training_ws": "",
      "sent_idx": 0,
      "par_idx": 0,
      "lang": "en-us",
      "voice": "",
      "pos": null,
      "phonemes": [
        "‖"
      ],
      "is_major_break": true,
      "is_minor_break": false,
      "is_punctuation": false,
      "is_break": true,
      "is_spoken": false,
      "pause_before_ms": 0,
      "pause_after_ms": 0
    }
  ],
  "pause_before_ms": 0,
  "pause_after_ms": 0
}

对于整个输入行和每个单词，text 属性包含处理后的输入文本（带有规范化的空白），而 text_with_ws 保留原始空白。text_spoken 属性仅包含可发音的单词，因此不包括标点符号和断句。

每个单词内部包含：

idx - 单词在句子中的从零开始的索引
sent_idx - 句子在输入文本中的从零开始的索引
pos - 词性标签（如果可用）
phonemes - 单词的 IPA 音素列表（如果可用）
is_minor_break - 如果"单词"分隔短语（逗号、分号等），则为 true
is_major_break - 如果"单词"分隔句子（句号、问号等），则为 true
is_break - 如果"单词"是主要或次要断句，则为 true
is_punctuation - 如果"单词"是周围的标点符号（引号、括号等），则为 true
is_spoken - 如果不是断句或标点符号，则为 true

有关更多选项，请参阅 python3 -m gruut <语言> --help。

SSML

支持 SSML 的一个子集：

<speak> - 包裹 SSML 文本
- lang - 设置文档的语言
<p> - 段落
- lang - 设置段落的语言
<s> - 句子（禁用自动断句）
- lang - 设置句子的语言
<w> / <token> - 单词（禁用自动分词）
- lang - 设置单词的语言
- role - 设置单词角色（参见单词角色）
<lang lang="..."> - 设置内部文本的语言
<voice name="..."> - 设置内部文本的语音
<say-as interpret-as=""> - 强制解释内部文本
- interpret-as 可以是 "spell-out"、"date"、"number"、"time" 或 "currency" 之一
- format - 根据 interpret-as 格式化文本的方式
  - number - "cardinal"、"ordinal"、"digits"、"year" 之一
  - date - 包含 "d"（基数日）、"o"（序数日）、"m"（月）或 "y"（年）的字符串
<break time=""> - 暂停给定的时间
- time - 秒（"123s"）或毫秒（"123ms"）
<mark name=""> - 用户定义的标记（单词/句子的 marks_before 和 marks_after 属性）
- name - 标记的名称
<sub alias=""> - 用 alias 替换内部文本
<phoneme ph="..."> - 为内部文本提供音素
- ph - 内部文本每个单词的音素，以空白分隔
<lexicon id="..."> - 内联或外部发音词典
- id - 词典的唯一 id（用于 <lookup ref="...">）
- uri - 如果为空或缺失，则词典为内联
- 一个或多个带有以下内容的 <lexeme> 子元素：
  - 可选的 role="..."（以空白分隔的单词角色）
  - <grapheme>单词</grapheme> - 单词文本
  - <phoneme>音素</phoneme> - 单词发音（音素以空白分隔）
<lookup ref="..."> - 为子元素使用发音词典
- ref - 来自 <lexicon id="..."> 的 id

单词角色

在音素化过程中，使用词语角色来消除发音歧义。除非手动指定，词语的角色是根据其词性标签派生而来的，形式为gruut:<TAG>。对于首字母缩写词和拼读，使用gruut:letter角色来表示例如"a"应该发音为/eɪ/而不是/ə/。

对于en-us，词性标注器还提供以下额外角色：

gruut:CD - 数词
gruut:DT - 限定词
gruut:IN - 介词或从属连词
gruut:JJ - 形容词
gruut:NN - 名词
gruut:PRP - 人称代词
gruut:RB - 副词
gruut:VB - 动词
gruut:VB - 动词（过去时）

内联词典

通过<lexicon>和<lookup>标签支持内联发音词典。gruut在这里与SSML标准略有不同，允许在SSML文档内定义词典（url为空或缺失）。此外，<lexicon>元素的id属性可以省略，表示一个"默认"的内联词典，不需要相应的<lookup>标签。

例如，以下文档将产生"tomato"一词的三种不同发音：

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <lexicon xml:id="test" alphabet="ipa">
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        <!-- 各个音素用空格分隔 -->
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
    <lexeme>
      <grapheme role="fake-role">
        tomato
      </grapheme>
      <phoneme>
        <!-- 为假设的词语角色创造的发音 -->
        t ə m ˈi t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
  <lookup ref="test">
    <w>tomato</w>
    <w role="fake-role">tomato</w>
  </lookup>
</speak>

第一个"tomato"将在美式英语词典中查找（/t ə m ˈeɪ t oʊ/）。在<lookup>标签的作用域内，第二个和第三个"tomato"将在内联词典中查找。第三个"tomato"词附加了一个角色（在本例中选择了一个虚构的发音）。

gruut甚至允许完全省略<lexicon>的id。没有id时，就不再需要<lookup>标签，这样你可以覆盖文档中任何词的发音：

<?xml version="1.0"?>
<speak version="1.1"
       xmlns="http://www.w3.org/2001/10/synthesis"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
       xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                 http://www.w3.org/TR/speech-synthesis11/synthesis.xsd"
       xml:lang="en-US">

  <!-- 没有id意味着改变所有没有lookup的词 -->
  <lexicon>
    <lexeme>
      <grapheme>
        tomato
      </grapheme>
      <phoneme>
        t ə m ˈɑ t oʊ
      </phoneme>
    </lexeme>
  </lexicon>

  <w>tomato</w>
</speak>