在线和离线配置的WER差异

问

I was running the online setup based on rm/s5 scripts for GMM systems. This is a custom database with more than 100 hours of speech. We do not have speaker information, so every utterance has a new speaker.

The best WER I could achieve was
exp/tri2b_mmi_b0.05/decode_it4 %WER 8.46

However, the online models trained using the above models give exp/tri2b_online/decode %WER 21.93

Is it expected to give such a big difference between online and offline setups ?

我正在运行基于rm / s5脚本的GMM系统在线设置。这是一个自定义数据库，具有100多个小时的语音通话时间。我们没有发言人的信息，因此每个发言都有一位新发言人。

我能达到的最佳WER是
exp / tri2b_mmi_b0.05 / decode_it4％WER 8.46

但是，使用上述模型训练的在线模型给出了
exp / tri2b_online / decode％WER 21.93

预计在线和离线设置之间会有如此大的差异吗？

答

I don't think RM is the best example to start with, if you have more than 100 hours of speech.
In general, the online DNN setup is expected to perform (almost) as good as the offline DNN setup and almost certainly significantly better than SGMM setup如果您的演讲时间超过100小时，我认为RM不是最好的例子。
通常，在线DNN设置的性能（几乎）与离线DNN设置一样好，并且几乎可以肯定比SGMM设置要好得多。

答

I think the tri2b_online is online decoding using GMMs. I don't have a lot of experience with that setup, as we generally
prefer the DNN-based setup; however, it's disappointing that there is so much degradation. We do expect some degradation, but not anywhere
near that much. I would try to see if there is some kind of bug (e.g. some kind of configuration mismatch), or maybe a major mismatch
between test and train conditions.我认为tri2b_online是使用GMM在线解码的。我在该设置方面没有太多经验，因为我们通常
更喜欢基于DNN的设置。然而，令人失望的是有这么多的退化。我们确实预期会有一些退化，但不会有那么大的退化。我将尝试查看是否存在某种错误（例如某种配置不匹配），或者测试和训练条件之间是否存在重大不匹配。

答

Thank you for your reply. Currently we do not have any powerful GPU computers. As a result, we decided to stick with GMM systems. The non GPU online nnet decoder was slower than real time, for the LM we are using.

On the training side, I am using local/online/run_gmm.sh script, with all tri3b replaced with tri2b.谢谢您的回复。当前，我们没有任何功能强大的GPU计算机。因此，我们决定坚持使用GMM系统。对于我们正在使用的LM，非GPU在线nnet解码器比实时速度慢。

在训练方面，我使用的是local / online / run_gmm.sh脚本，所有的tri3b都替换为tri2b。

答

If you play with the number of threads in multithreaded BLAS implementations (e.g. ATLAS, maybe OpenBLAS), it should be possible to get it real-time if the GMM one was real-time. E.g. 2 threads should be sufficient, I would guess.如果您在多线程BLAS 实现中使用线程数（例如ATLAS，也许是OpenBLAS），那么如果GMM是实时的，则应该可以实时获取线程数。我想，例如2个线程就足够了。

答

I spend some time exploring the online feature pipeline. The only difference seems to be about the way CMVN is applied. The offline setup has a per utterance based CMVN while the online setup has something different, based on online estimation of means. It looks as if the mean is estimated based on a window of past features, and some weighted smoothing is applied based on the global stats. But I am confused a bit. (my code understanding abilities are minimally exceptional) Could you give me some insight into how mean is estimated in the online setup ?

我花了一些时间探索在线特性管道。唯一的区别似乎与CMVN的应用方式有关。离线设置具有基于语音的CMVN，而在线设置则基于在线均值估算而有所不同。

看起来好像是根据过去特征的窗口来估计平均值，并且基于全局统计信息应用了一些加权平滑。但是我有点困惑。（我的代码理解能力极少例外）

您能否给我一些关于在线设置中估计均值的见解？

答

Yes, the mean that is subtracted comes from a window of past features. The global and previous-utterance stats come into play near the beginning of the utterance, where you don't have enough history. The idea is to first back off to the previous utterances of the same speaker, and if not available (or too short), back off to global stats. Renata was asking the same question, you might want to share your insights with her if you have any. Necessarily online CMVN is different from batch CMVN, as we can't see the future.

是的，减去的均值来自过去特征的窗口。
全局和上一次语音统计信息会在语音开始时（即您没有足够的历史记录的时候）开始起作用。这个想法是先回退同一位发言人的先前讲话，如果不可用（或太短），则回退到全局统计。
Renata在问同样的问题，如果有，您可能想与她分享您的见解。在线CMVN与批处理CMVN必然有所不同，因为我们看不到
未来。

友情链接

汕头招聘网 | 山东招聘网 | 郑州教育培训 | 软件下载