虚拟机里配置GPU运行DNN

问

Hi! I try to create dnn model according to run_dnn.sh script.I have created fMMLR feats. After that I ran command: 嗨！我尝试根据run_dnn.sh脚本创建dnn模型。我创建了fMMLR特征。之后，我运行命令：

steps/nnet/pretrain_dbn.sh --rbm-iter 3 $data_fmllr/train $dir
Train set containes 73h Voxforge data. I'm using Mint 17 in Virtual Box. So, i think its main reason that script train only 3 layers for week of computing.I have no that kind of problems in creating models from Voxforge recipe. I tried to change parameters in script for using CPU vice GPU:训练集包含73h Voxforge数据。我在Virtual Box中使用Mint 17。因此，我认为其主要原因是脚本只需要训练3层即可进行一周的计算。
我试图更改脚本中的参数以使用CPU副GPU：
nnet-forward --use-gpu=no
But it is still too slow. How can i speed up the computing?And where i can find a mistake我如何加快计算速度？在哪里可以找到错误？

答

If you don't have GPUs, there is not much point trying this, it will be too slow. The nnet2 setup is faster if you have a lot of CPUs, as it supports multi-threaded and multi-machine training- but still it's best if you have GPUs.如果您没有GPU，那么尝试这样做没有多大意义，那就太慢了。如果您有很多CPU，则nnet2设置会更快，因为它支持多线程和多机器培训-但是，如果您有 GPU，它仍然是最好的。

答

There are some tricks how to get the GPU (if you have it) into the virtual machine. It wasn't straightforward (last time I checked) but it was
possible.Perhaps someone on the list running GPU in a virtual machine could give some advice? 有一些技巧可以把GPU（如果有）放入虚拟机。这不是很简单（我上次检查），但是有可能，也许邮件列表中在虚拟机上运行GPU的人可以给出建议

Now, I trying to run run_5d.sh from nnet2. For using CPU, I change parameters cmd.sh (train_cmd = run.pl , decode_cmd = run.pl).Everything is OK, but my virtual machine crash, when njobs more than 4 at the get egs stage (big load on disk). I tried to rerun train_pnorm_fast with parameters (train_stage = -2, and njobs = 8) ,when the all previous steps have been performed with 4 jobs. But have an error : >>run.pl: 4/8 Failed. Log file :"Error constructing table reader". So, can I use more jobs for train stage? 现在，我尝试从nnet2运行run_5d.sh。\r\n 为了使用CPU，我更改了参数cmd.sh（train_cmd = run.pl，decode_cmd = run.pl），一切都很好，但是当在get egs阶段njobs大于4时，我的虚拟机崩溃了（磁盘上的大负载）。\r\n 当所有先前的步骤都用4个作业执行时，我尝试使用参数（train_stage = -2和njobs = 8）重新运行train_pnorm_fast。\r\n 但是有一个错误：>> run.pl：4/8失败。\r\n 日志文件：“构造表读取器时出错”。\r\n 那么，我可以在训练上使用更多工作吗？

That script is kind of out of date. train_pnorm_simple2.sh will spend less time dumping egs. But make sure your Kaldi is up to date. By default get_egs.sh gives the option "-tc 5" while getting the egs, so no more than 5 jobs can run, but that option has only since recently been supported by run.pl (previously it was ignored). 该脚本有点过时了。 train_pnorm_simple2.sh将花费更少的时间转储egs。但是，请确保您的Kaldi是最新的。缺省情况下，get_egs.sh在获取egs时给出选项“ -tc 5”，因此最多可以运行5个作业，但是该选项直到最近才由run.pl支持（以前被忽略）。

After svn update I have another problem. svn更新后，我还有另一个问题。

Command make depend -j 8 succeeds. But at make stage I got Errors: 命令makedepend -j 8成功。但是在make阶段我遇到了错误：

/home/pittman/kaldi-trunk/src/gmmbin/gmm-align.cc:136: undefined reference to `kaldi::AlignUtteranceWrapper(kaldi::AlignConfig const&, std::string const&, float, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::DecodableInterface, kaldi::TableWriter<kaldi::BasicVectorHolder<int> >, kaldi::TableWriter<kaldi::BasicHolder<float> >, int, int, int, double, long*)'

                collect2: error: ld returned 1 exit statussgmm-align-compiled.o: In function main': /home/pittman/kaldi-trunk/src/sgmmbin/sgmm-align-compiled.cc:164: undefined reference tokaldi::AlignUtteranceWrapper(kaldi::AlignConfig const&, std::string const&, float, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::DecodableInterface, kaldi::TableWriter<kaldi::BasicVectorHolder<int> >, kaldi::TableWriter<kaldi::BasicHolder<float> >, int, int, int, double, long*)'

                collect2: error: ld returned 1 exit status

Is it all about gcc version? (gcc version 4.8.2 (Ubuntu 4.8.2-19ubuntu1)) 一切都与gcc版本有关吗？（gcc版本4.8.2（Ubuntu 4.8.2-19ubuntu1））

did you do "make clean" after svn update or after "make depend"? 在svn更新之后或在“使依赖”之后，您是否做了“清洁”？

Sorry, but I hastended to conclusions. Did "make clean" command before make depend: 抱歉，但是我已经得出结论了。在进行make之前是否执行过“ make clean”命令：

align-mapped.o: In function main': /home/pittman/kaldi-trunk/src/bin/align-mapped.cc:129: undefined reference tokaldi::AlignUtteranceWrapper(kaldi::AlignConfig const&, std::string const&, float, fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, kaldi::DecodableInterface, kaldi::TableWriter<kaldi::BasicVectorHolder<int> >, kaldi::TableWriter<kaldi::BasicHolder<float> >, int, int, int, double, long*)'

                collect2: error: ld returned 1 exit status

That's odd. The symbol should be defined in decoder/kaldi-decoder.a, it comes from decoder-wrappers.cc in that directory. Try to do "make" in decoder/, then "make" in bin, and make sure decoder/kaldi-decoder.a is on the linking line. I don't understand what's happening. 真奇怪该符号应在解码器/kaldi-decoder.a中定义，它来自该目录中的decoder-wrappers.cc。尝试在解码器/中执行“ make”，然后在bin中执行“ make”，并确保解码器/kaldi-decoder.a在链接线上。我不明白发生了什么。

友情链接