精通
英语
和
开源
,
擅长
开发
与
培训
,
胸怀四海
第一信赖
服务方向
联系方式
I use nnet2 to train ReLU-DNN, Qrstep often failed because “failed: KALDI_ISFINITE(x)”.
In cross entropy, I set the learning rate lower(0.001) than sigmoid(0.01), it didn't happen again. But when I did discriminative training(--num-jobs-nnet 4), I set learning rate 0.000006(a very very small value, i think), but when nnet training pass 70, almost all the 4 jobs failed. 我使用nnet2训练ReLU-DNN,Qrstep经常失败,因为“失败:KALDI_ISFINITE(x)”。在交叉熵中,我将学习率设置为lower(0.001)而不是sigmoid(0.01),这种情况不再发生。但是当我进行有区别的培训(--num-jobs-nnet 4)时,我将学习率设置为0.000006(我认为这是一个非常小的值),但是当nnet培训通过70时,几乎所有4个工作都失败了。
Matrix operations may have "Numerical stability" problem. 矩阵运算可能会出现“数值稳定性”问题。
I haven't gone deep into the QR code, just report this. 我还没有深入研究QR代码码,只需报告一下即可。
Fail log: 失败日志:
nnet-combine-egs-discriminative ark:exp/fbank40_h5t2048_ReLU_degs/degs.$[((2-1+(90*4))%1071)+1].ark ark:- | nnet-train-discriminative-simple --silence-phones=1 --criterion=smbr --drop-frames=false --one-silence-class=true --boost=0.0 --acoustic-scale=0.1 --gpu-id=1 exp/fbank40_h5t2048_ReLU_NP_smbr_0.000006/90.mdl ark:- exp/fbank40_h5t2048_ReLU_NP_smbr_0.000006/91.2.mdl
Started at Tue Apr 7 10:36:43 CST 2015
nnet-combine-egs-discriminative ark:exp/fbank40_h5t2048_ReLU_degs/degs.362.ark ark:-
nnet-train-discriminative-simple --silence-phones=1 --criterion=smbr --drop-frames=false --one-silence-class=true --boost=0.0 --acoustic-scale=0.1 --gpu-id=1 exp/fbank40_h5t2048_ReLU_NP_smbr_0.000006/90.mdl ark:- exp/fbank40_h5t2048_ReLU_NP_smbr_0.000006/91.2.mdl
...LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.836495, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.75548, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.660106, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.623963, for component index 5
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.226656, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.965409, for component index 13
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.63207, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.960904, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.632257, for component index 3
LOG (nnet-train-discriminative-simple:GetScalingFactor():nnet-component.cc:1911) Limiting step size using scaling factor 0.979007, for component index 13
KALDI_ASSERT: at nnet-train-discriminative-simple:QrStep:qr.cc:265, failed: KALDI_ISFINITE(x)
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const, char const, int, char const)
void kaldi::QrStep<float>(int, float, float, kaldi::MatrixBase<float>)
void kaldi::QrInternal<float>(int, float, float, kaldi::MatrixBase<float>)
kaldi::SpMatrix<float>::Qr(kaldi::MatrixBase<float>)
.
.
.
kaldi::nnet2::NnetDiscriminativeUpdater::Update()
kaldi::nnet2::NnetDiscriminativeUpdate(kaldi::nnet2::AmNnet const&, kaldi::TransitionModel const&, kaldi::nnet2::NnetDiscriminativeUpdateOptions const&, kaldi::nnet2::DiscriminativeNnetExample const&, kaldi::nnet2::Nnet, kaldi::nnet2::NnetDiscriminativeStats)
nnet-train-discriminative-simple(main+0x50f) [0x65c03c]
/lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5) [0x7fe77c59eec5]
nnet-train-discriminative-simple() [0x65ba69]
WARNING (nnet-train-discriminative-simple:~Mutex():kaldi-mutex.cc:45) Error destroying pthread mutex; ignoring it as it could be a known issue that affects Haswell processors, see https://sourceware.org/bugzilla/show_bug.cgi?id=16657 If your processor is not Haswell and you see this message, it could be a bug in Kaldi. However it could be that multi-threaded code terminated messily.
KALDI_ASSERT: at nnet-train-discriminative-simple:QrStep:qr.cc:265, failed: KALDI_ISFINITE(x)
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const, char const, int, char const)
void kaldi::QrStep<float>(int, float, float, kaldi::MatrixBase<float>)
void kaldi::QrInternal<float>(int, float, float, kaldi::MatrixBase<float>)
kaldi::SpMatrix<float>::Qr(kaldi::MatrixBase<float>)
.
.
.
kaldi::nnet2::NnetDiscriminativeUpdater::Update()
kaldi::nnet2::NnetDiscriminativeUpdate(kaldi::nnet2::AmNnet const&, kaldi::TransitionModel const&, kaldi::nnet2::NnetDiscriminativeUpdateOptions const&, kaldi::nnet2::DiscriminativeNnetExample const&, kaldi::nnet2::Nnet, kaldi::nnet2::NnetDiscriminativeStats)
nnet-train-discriminative-simple(main+0x50f) [0x65c03c]
/lib/x86_64-linux-gnu/libc.so.6(libc_start_main+0xf5) [0x7fe77c59eec5]
nnet-train-discriminative-simple() [0x65ba69]Accounting: time=4 threads=1
Ended (code 255) at Tue Apr 7 10:36:47 CST 2015, elapsed time 4 seconds
This isn't a problem with matrix operations, it's a problem with
instability of stochastic gradient descent when using nonlinearities with
unbounded outputs such as ReLU. We normally solve this by putting a
NormalizeComponent after each ReLU component. 当使用具有无穷大输出的非线性(例如ReLU)时,随机梯度下降的不稳定性。通常,我们通过在每个ReLU组件之后放置一个NormalizeComponent来解决此问题。
Actually, if your version of Kaldi is not up to date it's possible that it is a problem with QR. But it has been stable for a long time now. 实际上,如果您的Kaldi版本不是最新的,则QR可能有问题。但是它已经稳定了很长时间了。
0.000001 learning rate is ok, 0.000002 training failed. It is tricky.
0.000001学习率还可以,0.000002训练失败。这很棘手。