精通
英语
和
开源
,
擅长
开发
与
培训
,
胸怀四海
第一信赖
First thank you for your works. I have used DNN based online-decoding setup to get expected result. Now, I know it works but I do not know why it works. Especially the i-vector's effect. So, I have some questions about it, maybe my questions is so easy, but for me is really important. could you help me?
my questions as follows:
1. Why use i-vector in Dnn-based online-decoding setup? What is the main effect of i-vector?
2. When we use online-wav-nnet2-latgen-faster to decode wav file, how extract i-vector online, does every utterance use the same i-vector? if not, does extract i-vector for every 10 frames or others?
首先感谢您的工作。我已使用基于DNN的在线解码设置来获得预期的结果。现在,我知道它有效,但是我不知道它为什么起作用。特别是i-vector的效果。因此,我对此有一些疑问,也许我的问题很简单,但对我而言确实很重要。你可以帮帮我吗?
我的问题如下:
1.为什么在基于Dnn的在线解码设置中使用i-vector?i-vector的主要作用是什么?
2.当我们使用online-wav-nnet2-latgen-faster更快地解码wav文件时,如何在线提取i-vector,是否每个发音都使用相同的i-vector?如果不是,是否每10帧或其他帧提取i-vector?
The DNN models (from the Dan's nnet2) use the i-vectors to provide the neural network with the speaker identity. The input features are not speaker-normalized -- it's left to the network to figure this out.
During the decoding, the trained i-vector extractor is used to estimate the i-vectors. They are extracted based on spk2utt map parameter of
the online2-wav-gmm-latgen-faster.
You can create various mappings (for example, you can make each utterance uttered by a unique speaker, or just carry out the mapping from the data dir)...The scripts steps/online/decode.sh and egs/rm/s5/local/online/run_nnet2.sh (for example) will hopefully answer your questions about how is it done.
DNN模型(来自Dan的nnet2)使用i-vector向神经网络提供说话者身份。输入功能未经扬声器标准化-留给网络解决。
在解码期间,训练有素的i向量提取器用于估计i向量。它们是根据online2-wav-gmm-latgen-faster的spk2utt映射参数提取的。
您可以创建各种映射(例如,可以让每个说话者由一个独特的说话者发出声音,或者仅从数据目录中执行映射)...例如,脚本steps / online / decode.sh和egs / rm / s5 / local / online / run_nnet2.sh将希望回答您有关其完成方式的问题。
the iVector is extracted every 10 frames during training, but the
input to the computation is all frames of the same speaker that are
prior to the current frame. This is to emulate the online test
condition.
顺便说一句,iVector在训练期间每10帧提取一次,但是计算的输入是同一说话者在当前帧之前的所有帧。这是为了模拟在线测试条件。
Thanks a lot! I want to know that does this way suit for dialogue condition. Does it extract an ivector for a speaker or for an utterance When an utterance include two or more speakers. In other words, Whether made speaker detection when extract ivector.非常感谢!我想知道这样做是否适合对话条件。当说话包含两个或多个说话者时,它是为说话者还是为说话者提取ivector?换句话说,提取ivector时是否进行说话者检测。
For dialogue, what you need is speaker diarization, not just speaker
identification. Vimal and David (cc'd) are working on a speaker
diarization setup for Kaldi, but it will be a few months, most likely,
before it's ready.进行对话时,您需要的是说话人差异化,而不仅仅是说话人
识别。Vimal和David(cc'd)正在为Kaldi进行扬声器二值化设置,但准备工作还需要几个月的时间