当前位置：锐英源 / 开源技术 / 语音识别开源 / kaldi按音素输出语音识别结果的讨论，欢迎加入kaldi QQ群：14372360

服务方向

联系方式

固话：0371-63888850
音素：138-0381-0136

Q Q：396806883
微信：ryysoft
头条号：软件技术及人才和养生
kaldi QQ群：14372360

锐英源精品原创，禁止全文或局部转载，禁止任何形式的非法使用，侵权必究。锐英源软件对经典开源项目有大量翻译，翻译内容技术层次较高，对初学者有深究意义。有幸浏览到的朋友请关注头条号，可以获取最新更新。English

kaldi按音素输出语音识别结果的讨论

背景

kaldi语音识别输出和语言特性关系密切，细节优化很多。本文和这个有关系，所以翻译以供学习，致谢kaldi开源项目团队。

中文

我认为我没有任何脚本方案可用。您可能想要训练一个没有词位置依赖性的系统（可以选择 prepare_lang.sh 来执行此操作），然后为了解码，您将准备一个词典，其中单词和音素之间存在一对一的映射，您可以在其上构建 LM。但是不允许重复静音是一个微妙之处，这会在格子中造成很多无用的混乱。这可以通过将 LM 与一个特殊的小 FST 组合来实现。

顺便说一句，丹，你提到静音标记重复很有趣。前段时间我做了一些涉及音素识别的实验，看到了这种连续出现多个（大多数情况下为 2 个）SIL 的现象，但没有调查是什么原因造成的。我没有使用格子，而是“更快”的解码器。
是不是因为 SIL 输出标签在确定过程中以某种方式被推回，并且由于静音音素的环形拓扑结构，相应的弧线被多次遍历？
还有你提到的特殊FST是什么？

嗨，瓦西尔，
你是如何在词典和 LM 中表示静音的？只是将其视为词典中的一个单词并在包含静音标记的文本上训练语言模型？当词典没有可选的 sil 转换并且静音在 LM 中不被视为一个词时，处理静音的一种方法是使用下面论文中的“随机静音”模型。但这会导致输出中出现多个静音标记，所以我认为 Dan 正在做其他事情。
Allauzen, C.、Mohri, M.、Riley, M.、Roark, B.，“集成语音识别传感器的广义构造”，在 Proc. ICASSP，第 761764 页，2004

嗨，保罗，
我是在长音频对齐的背景下进行这些实验的，其中使用音素二元组对转录和音频的黑白差异进行建模（即类似于http://citeseerx.ist.psu 中描述的内容。 edu/viewdoc/summary?doi=10.1.1.154.5104）。我尝试了几种略有不同的图形配置，但据我所知，给我的图形配置中连续多个 SIL 标记将静音建模为可选的 SIL：在转录本中每两个单词之间的每个状态的 SIL 自循环（没有词典中的可选静音，并且静音不是代表插入和替换的音素二元组垃圾模型的一部分）。
所以我想知道搜索“更喜欢”进入 SIL 音素模型一次，然后退出，然后再回到它的原因是什么，考虑到与自循环相关的惩罚（IIRC 对应于静音插入概率1/10 或 1/20）。

感谢您的参考 - 我不能老实说我理解那篇论文中的所有内容，因为它似乎需要阅读其他一些关于各种算法的内容，但我想我理解了关于静音建模的部分。

关于多重静音——有时它更喜欢重复静音，这样它就可以回到第一个静音状态，否则其他状态是无法到达的。一般来说，如果可能的话最好禁止这样做，因为它会导致晶格中无用的混乱。

实际上，瓦西尔——我认为还有另一个问题。格生成算法旨在为每个输出符号序列提供路径，这些路径在最佳光束内具有可能性。如果静音在图中显示为输出符号，则具有一到两个静音的路径具有不同的符号序列，如果它们都在光束内，则晶格将包含它们。在 IBM 的格生成算法中，我们专门处理了静音以避免这种情况，但 Kaldi 方法通常不会使静音成为输出符号，因此它将为任何单词序列选择最佳的“静音路径”。对于音素语言模型，将其设为输出符号但不允许重复可能更方便。无论哪种方式都应该解决这个问题。

看看 egs/timit/s5，它可能会满足您的需求。

Xavier——不清楚你在问什么。但无论如何，无论它是什么脚本方案，它可能足以满足您的目的。静音的东西更多的是一种优化。

English

I don't think I have any recipe checked in. You would probably want to train a system without word-position dependency (there is an option to prepare_lang.sh to do this), and then for decoding, you would prepare a lexicon where there is a one-to-one map between words and
phones, and you would build an LM on that. But there is a subtlety about not allowing repeats of silence, which would cause a lot of
useless confusion in the lattices. This can be accomplished by composing the LM with a special small FST.

By the way, Dan, it's interesting you mention silence token repetitions. I did some experiments involving phoneme recognition some time ago and saw this phenomenon of multiple(2 in most cases) SILs in row, but didn't investigate what causes it. I wasn't using lattices, but "faster" decoder.
Is it because the SIL output label are somehow pushed back during determinization, and due to the loopy topology of the silence phone the respective arc is traversed many times?
Also what is the special FST you mention?

Hi Vassil,
How did you represent silence int lexicon and LM? Just treating it as a word in the lexicon and training the language model on text containing
silence tokens?
When the lexicon doesn't have optional sil transitions and silence isn't treated as as a word in the LM, one way to deal silences is use the
'stochastic silence' model from the below paper. But this will lead to multiple silence tokens in the output so I think Dan is taking about something else.
Allauzen, C., Mohri, M., Riley, M., Roark, B., “A Generalized Construction of Integrated Speech Recognition Transducers,” in Proc. ICASSP, pp. 761764, 2004

Hi Paul,
I was doing these experiments in the context of long audio alignment, where phone bigrams were used to model the discrepancies b/w the transcription and the audio(i.e. something like what's described in http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.154.5104). I tried several slightly different graph configurations, but as far as I remember the one that gave me for sure multiple SIL tokens in a row had silence modelled as optional SIL:SIL self-loops at each state between every two words in the transcript(no optional silence in the lexicon, and the silence wasn't part of the phone bigram garbage model representing insertions and substitutions).
So I was wondering what could be the reason for the search to 'prefer' going into SIL phone model once, then exiting, then going back into it, considering there was penalty associated with the self-loop(IIRC corresponding to silence insertion probability of 1/10 or 1/20).

Thanks for the reference - I can't honestly say I understand everything in that paper, as it seems to require reading several others about various algorithms, but I think I understood the part about silence modelling.

Re the multiple silences-- sometimes it prefers to repeat silence so that it can get back to the first state of silence, which is otherwise
not reachable from the other states. Generally it's best to disallow this if possible, as it leads to useless confusion in the lattices.

Actually, Vassil-- I think there is another issue. The lattice generation algorithms aims to have paths present for each output-symbol sequence that has likelihood within a beam of the best. If silence appears as an output symbol in the graph, then paths with
one vs. two silences have different symbol sequences, and if they are both within the beam then the lattice will contain both of them. In
the lattice generation algorithms from IBM we treated silence specially to avoid this, but the Kaldi approach is to generally to not
make silence an output symbol, so it will pick the best "silence path" for any word sequence. For phone language models, it may be more
convenient to make it an output symbol but disallow repeats. Either way should get around the problem.

Take a look at egs/timit/s5, which will probably suit your needs.

Xavier-- it's not clear what you're asking about. But anyway, whatever recipe it is, it will probably be sufficient for your purposes. The stuff with the silence is more of an optimization.

友情链接