train_multisplice_accel2.sh内核崩溃

When I run this script it causes the kernel to crash. I use the parameters as provided in the WSJ example: 当我运行此脚本时，它将导致内核崩溃。我使用WSJ示例中提供的参数：

steps/nnet2/train_multisplice_accel2.sh --stage $train_stage \
               --exit-stage $exit_train_stage \
                --num-epochs 8 --num-jobs-initial 2 --num-jobs-final 14 \
                 --num-hidden-layers 4 \ --splice-indexes "layer0/-1:0:1 layer1/-2:1 layer2/-4:2" \
                  --feat-type raw \ --online-ivector-dir $exp/nnet2_online/ivectors_train \ 
                  --cmvn-opts "--norm-means=false --norm-vars=false" \ --num-threads "$num_threads" \ --minibatch-size "$minibatch_size" \ --parallel-opts "$parallel_opts" \ --io-opts "-tc 12" \ --initial-effective-lrate 0.005 --final-effective-lrate 0.0005 \ --cmd "$decode_cmd" \ --pnorm-input-dim 2000 \ --pnorm-output-dim 250 \ --mix-up 12000 \ $data/train_hires $lang $exp/tri4b $dir || exit 1;

The last message recorded by the GPU node is: GPU节点记录的最后一条消息是：

May 13 14:58:34 node001 kernel: nvidia 0000:0c:00.0: irq 93 for MSI/MSI-X

I don't think the message you've sent means anything -- you should investigate the training logs and search for error -- you can find them in $dir/log 我认为您发送的消息没有任何意义-您应该调查训练日志并查找错误-您可以在以下位置找到它们 $ dir /日志

Can you send the full name of the log file? I'm not sure if you are looking into the right ones -- they should be located in
exp/nnet2_online/nnet_ms_a/log/ And there should be more of them. Also, the paths look weird -- the scripts usually use relative paths, not
absolute (/data/ac1ssf/wsj-pfstar/exp/nnet2_online/nnet_ms_a/*). Because of all of this, I tend to believe there is some problem with your
setup, not really with the GPU. exp/nnet2_online/nnet_ms_a/log/路径看起来很奇怪-脚本通常使用相对路径，而不是绝对路径（/data/ac1ssf/wsj-pfstar / exp / nnet2_online / nnet_ms_a / * ）。因此，我倾向于认为您的设置存在问题，而不是GPU确实存在问题。

I've attached all the log files here. The reason for the full paths is because I am using a shared computer with separate code and data spaces so I've adapted the scripts slightly for this. I've run many SGMM and neural net experiments with Kaldi before without any problems - this is the first time I've had a kernel crash, and it is with the new online nnet code. 我已经在这里附加了所有日志文件。完整路径的原因是因为我使用的共享计算机具有单独的代码和数据空间，因此我为此略微修改了脚本。之前，我使用Kaldi进行了许多SGMM和神经网络实验，没有任何问题-这是我第一次遇到内核崩溃，这是新的在线nnet代码造成的。

I suggest to run one or two of the training jobs and see what happens. What can sometimes happen on computers with insufficient power
supplies is that when you load your GPUs heavily the computer can die. If the computer just suddenly powers down or reboots, this is usually
the reason. If you have just a computer with a few GPUs then you shouldn't be using that setup with --num-jobs-initial 2 and --num-jobs-final 14. Instead, set both values to the number of GPUs you have (e.g.1 or 2) and set to exclusive mode with nvidia-smi -c 1. You might want to
reduce the number of epochs a bit if it would take too long.
在电源供应不足的计算机上有时会发生的情况是，当GPU负载过多时，计算机可能会死机;如果计算机突然掉电或重新启动，通常是原因;如果您的计算机只有几个GPU，则您不应该在--num-jobs-initial 2和--num-jobs-final 14中使用该设置。相反，请将这两个值都设置为您拥有的GPU的数量（例如1或2）并设置为独占模式使用nvidia-smi -c 1.如果需要太长的时间，则可能需要减少一些时间。

友情链接