精通
英语
和
开源
,
擅长
开发
与
培训
,
胸怀四海
第一信赖
服务方向
联系方式
I didn't manage to find the answer to this elsewhere. In one sentence: is it possible to specify optional noise phonemes/words in Kaldi, like it is possible with silence? 我没有在其他地方找到答案。一句话:是否可以在Kaldi中指定可选的噪音音素/单词,就像可以保持沉默一样?
Longer description: 较长的说明:
If I have lexicon like: 如果我有这样的词典:
!SIL SIL
<INT> NSN
<SPK> SPN
... ...
where NSN and SPN are declared as silence phones. NSN和SPN被声明为静音音素。
Would the right way to achieve this be to put 实现这一目标的正确方法是
SIL NSN SPN
into optional_silence file? 进入optional_silence文件?
To clarify even more: in the training set <INT> and <SPK> events are annotated, e.g.: 为了进一步说明:在训练集中对<INT>和<SPK>事件进行了注释,例如:
... <SPK> NINE THREE <INT> ...
and I would like the training algorithm to recognize those "words", and to align appropriately. If possible, I would NOT like insertion of optional noise during training, because I know exactly where it is (and isn't). 我希望训练算法能够识别这些“单词”,并适当对齐。如果可能的话,我不希望在训练过程中插入可选的噪音,因为我确切地知道它在哪里(或不在)。
However, in the test set I also have these events annotated, but would like them to be ignored. I would like decoder to use optional NSN and SPN, but to ignore that recognized noise, and annotated events (not to count them as errors). Of course, I can erase these events from test set, if that would solve this part of the problem. 但是,在测试集中,我也注释了这些事件,但是希望忽略它们。我希望解码器使用可选的NSN和SPN,但忽略该已识别的噪声和带注释的事件(不将其视为错误)。当然,如果可以解决问题的这一部分,我可以从测试集中删除这些事件。
And, related to this, if I manage to pronounce these events as optional, how would it affect discriminative training, and what is recomended strategy for that part of the training? 并且与此相关的是,如果我设法将这些事件声明为可选事件,它将对歧视性培训产生怎样的影响?对于该部分培训,推荐的策略是什么?
The scripts currently only support having a single optional-silence phone.
The main consumer of the optional-silence phone is the script 这些脚本当前仅支持拥有一个可选的静音音素。 可选静音音素的主要使用者是脚本
prepare_lang.pl, which gives it to the script that creates the lexicon. prepare_lang.pl,将其提供给创建词典的脚本。
This can be done, of course, but wouldn't that affect LM scoring during decoding? I mean, there is no appropriate LM for e.g.: 当然可以这样做,但这不会影响解码期间的LM评分吗?我的意思是,没有适合的LM,例如:
<s> NINE <SPK> THREE <INT> </s>
It would be much better if these events could be excluded from LM scoring. 如果可以将这些事件排除在LM评分之外,那就更好了。
Since I believe that the only phone excluded from LM scoring is optional silence phone (correct me if I'm wrong), can you tell me what would be the easiest way to include SPN and NSN Gaussians into SIL phone? I can write my own module if there is no such one. 由于我认为唯一不被LM评分排除的音素是可选的静音音素(如果我错了,请纠正我),您能告诉我将SPN和NSN高斯纳入SIL音素的最简单方法是什么?如果没有这样的模块,我可以编写自己的模块。
It sounds to me like this is a scoring issue. You can just decode those things as normal, but delete them before scoring. Some of the scripts
local/score.sh do this already, e.g. the Switchboard or BABEL recipes.This can be done, of course, but wouldn't that affect LM scoring during
decoding? I mean, there is no appropriate LM for e.g.:好像是打分问题。能像正常解码处理这些事,但在打分前删除它们。有脚本已经删除了。当然可以这么做,但不会影响LM评分吗?我意思是,没有合适的LM。
By scoring, I mean the process of going from the decoded output to a word error rate. 计分是指从解码输出到单词错误率的过程。
Language modeling is a separate thing. 语言建模是另一回事。
NINE <SPK> THREE <INT>
It would be much better if these events could be excluded from LM scoring.
Certainly and should never appear in the output but their
probabilities should be included in the LM score if it's properly built. 当然并且永远不会出现在输出中,而是它们的 如果正确构建,则应将概率包括在LM分数中。
(they just mean begin and end of sentence). (它们只是指句子的开头和结尾)。
I assumed that if you have those things appearing in your training
transcripts, you'd be able to build an LM on them. But if that's not the 我认为如果您在训练中出现了这些东西 成绩单,您将可以在其上建立LM。但是如果不是
case, it might be better to allow SIL, <SPK> and <INT> as options in the
lexicon, at least during testing. Do do this you'd have to modify the perl 在这种情况下,最好将SIL,<SPK>和<INT>作为选项 词典,至少在测试过程中。为此,您必须修改perl
script that creates the lexicon FST. 用于创建词典FST的脚本。
Since I believe that the only phone excluded from LM scoring is optional silence phone (correct me if I'm wrong), can you tell me what would be the easiest way to include SPN and NSN Gaussians into SIL phone? I can write my own module if there is no such one.
Probably the best course of action is what I said above, but it looks like you don't have much background in this field so it may be hard for you. Go to www.openfst.org and do the FST tutorial there first. 最好的行动方法可能就是我上面所说的,但看起来 您在该领域没有太多背景,因此可能对您来说很难。请访问www.openfst.org并首先在此处进行FST教程。
By scoring, I mean the process of going from the decoded output to a word error rate.
Language modeling is a separate thing.
I understand that. 我明白那个。
<s> NINE <SPK> THREE <INT> </s> It would be much better if these events could be excluded from LM scoring. Certainly <s> and </s> should never appear in the output but their probabilities should be included in the LM score if it's properly built. (they just mean begin and end of sentence).
I know about start and end of sentence. By "these events" I meant SPK and INT. 我知道句子的开头和结尾。 “这些事件”是指SPK和INT。
I assumed that if you have those things appearing in your training transcripts, you'd be able to build an LM on them. But if that's not the
case, it might be better to allow SIL, <SPK> and <INT> as options in the lexicon, at least during testing. Do do this you'd have to modify the perl
script that creates the lexicon FST.
It can surely be done, but IMHO this is not something that LM should handle, since these events don't have anything to do with language (they could be virtually anywhere). Making LM from training set (including SPK and INT) is easy, but it does not make much sense to me. 当然可以做到,但是恕我直言,这不是LM应该处理的事情,因为这些事件与语言无关(它们几乎可以在任何地方)。用训练集(包括SPK和INT)制作LM很容易,但是对我而言并没有多大意义。
Another downside in proposed approach is suboptimal LM scoring of initially highly probable sequences. For example: 所提出的方法的另一个缺点是对最初高度可能的序列的LM评分不够理想。例如:
I AM A LITTLE BOY. 我是小男孩。
Would be scored very high with any normal LM, right? But: 任何正常的LM都会得分很高,对吗?但:
I <INT> AM A LITTLE <SPK> BOY. 我<INT>是一个小<SPK>男孩。
Would be scored much lower, even if we introduced SPK and INT into LM. Of course, if I had n-grams in LM that cover, with high probability, contexts that occur in this sentence, then it still would be scored high. But, as I said, this cannot be expected from SPK and INT, since they appear in utterances without any logic. 即使我们将SPK和INT引入LM,得分也会低得多。当然,如果我在LM中有n-grams很有可能覆盖了这句话中出现的上下文,那么它的得分仍然很高。但是,正如我说的那样,SPK和INT不能指望这一点,因为它们以无逻辑的话语出现。
I am still on the track that the best way to handle this would be to ignore SPK and INT during LM scoring in decoder, but what you have proposed is certainly one way to deal with this problem. 我仍然认为,解决此问题的最佳方法是在解码器的LM评分期间忽略SPK和INT,但是您提出的无疑是解决此问题的一种方法。
I agree that in principle maybe it's not elegant to have these things in
the LM, but in practice I think it will work OK as long as you are training
your LM from the training transcripts. If you have other sources of LM
data it may make sense to make them options in the lexicon. 我同意原则上将这些内容包含在其中可能并不优雅 LM,但实际上,我认为只要您接受培训就可以 培训成绩单上的LM。如果您还有其他LM来源 数据,使其在词典中成为选项可能很有意义。
I'm putting it on my TODO list to modify the scripts to support multiple
optional silences, if it turns out to be easily doable. (If anyone on the 我将其放在待办事项列表中以修改脚本以支持多个 可选的静音(如果事实证明很容易实现)。 (如果有人在
list thinks they are capable of doing this, please contact me, but it's
probably best left to those who are already quite familiar with Kaldi). 列表认为他们有能力做到这一点,请与我联系,但这是 最好留给已经非常熟悉Kaldi的人使用)。
I see one way to support this in (almost) optimal way, without any changes in code or even scripts. SPK, INT and similar events to be explicitly marked as !SIL words in the training set. This approach will implicitly make such acoustic events optional during decoding, and avoid their LM scoring. We will no longer have a model for "clean" silence, but I don't see the need for it, since we want all non phoneme events to be optional and possible between any two words. The only downside (if even that) I see in this approach is that we won't have temporal modelling of, otherwise separate, !SIL, SPK and INT models. But, in my experience, temporal modelling of such events is not of much importance. 我看到了一种(几乎)最佳方式来支持此功能的方法,无需对代码甚至脚本进行任何更改。 SPK,INT和类似事件在训练集中显式标记为!SIL字。这种方法将在解码期间隐式使此类声音事件成为可选事件,并避免其LM评分。我们将不再有“纯净”沉默的模型,但我不认为需要它,因为我们希望所有非音素事件都是可选的,并且在任何两个单词之间都是可能的。我看到的这种方法唯一的缺点(即使是那样)是,我们将不会对!SIL,SPK和INT模型进行时间建模,否则就不会进行建模。但是,以我的经验,此类事件的时间建模不是很重要。
BTW it seems that people that tested setups with more than one non-speech model have found, that it doesn't help much indeed: 顺便说一句,似乎已经用一种以上非语音模型测试了设置的人发现,这确实没有太大帮助:
Whether or not multiple non-speech models improve a speech recog- nition system depends on the targeted application, the training data, and the pre-processing. In our experiments the gain in recognition quality from including more than one non-speech model (in addition to a silence model) is small, if any