锐英源软件
第一信赖

精通

英语

开源

擅长

开发

培训

胸怀四海 

第一信赖

当前位置:锐英源 / 开源技术 / kaldi开发技术服务 / 使用GPU训练模型时出现内存不足

服务方向

人工智能数据处理
人工智能培训
kaldi数据准备
小语种语音识别
语音识别标注
语音识别系统
语音识别转文字
kaldi开发技术服务
软件开发
运动控制卡上位机
机械加工软件
软件开发培训
Java 安卓移动开发
VC++
C#软件
汇编和破解
驱动开发

联系方式

固话:0371-63888850
手机:138-0381-0136
Q Q:396806883
微信:ryysoft

使用GPU训练模型时出现内存不足


想做大型开发,要配置高标准的环境,否则会出奇怪的事。

I am encountering error while running RBM pre-training using nnet/pre-train_dbn.sh.I have GPU card installed with following specs:使用nnet / pre-train_dbn.sh运行RBM预训练时遇到错误,我安装了具有以下规格的GPU卡:

| NVIDIA-SMI 340.29 Driver Version: 340.29 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 GeForce GTX 650 Off | 0000:01:00.0 N/A | N/A |
| 10% 34C P8 N/A / N/A | 440MiB / 1023MiB | N/A Default |
+-------------------------------+----------------------+----------------------+
1st Layer of pre-training was fine and ran without any error. But while running second layer, it is giving me out of memory error again and again. If anyone can give their views, then it would be helpful. I am enclosing LOG rbm.2.log where script is exiting.预训练的第一层很好,并且没有任何错误地运行。但是在运行第二层时,它一次又一次地给我造成内存不足错误。如果任何人都可以发表意见,那将是有帮助的。我在退出脚本的地方附上LOG rbm.2.log。

rbm-train-cd1-frmshuff --learn-rate=0.4 --l2-penalty=0.0002 --num-iters=8 --drop-data=0.0 --verbose=1 '--feature-transform=nnet-concat exp//tr_splice5-1_cmvn-g.nnet exp//1.dbn - |' exp//2.rbm.init 'ark:copy-feats scp:exp//train.scp ark:- |' exp//2.rbm
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:242) Selecting from 1 GPUs
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:257) cudaSetDevice(0): GeForce GTX 650 free:628M, used:395M, total:1023M, free/total:0.613556
LOG (rbm-train-cd1-frmshuff:SelectGpuIdAuto():cu-device.cc:290) Selected device: 0 (automatically)
LOG (rbm-train-cd1-frmshuff:FinalizeActiveGpu():cu-device.cc:174) The active GPU is [0]: GeForce GTX 650 free:613M, used:410M, total:1023M, free/total:0.598903 version 3.0
LOG (rbm-train-cd1-frmshuff:PrintMemoryUsage():cu-device.cc:314) Memory used: 0 bytes.
LOG (rbm-train-cd1-frmshuff:DisableCaching():cu-device.cc:683) Disabling caching of GPU memory.
nnet-concat exp/SpeechEnh_pretrain-dbn/tr_splice5-1_cmvn-g.nnet exp/
/1.dbn -
LOG (nnet-concat:main():nnet-concat.cc:53) Reading exp//tr_splice5-1_cmvn-g.nnet
LOG (nnet-concat:main():nnet-concat.cc:65) Concatenating exp/
/1.dbn
LOG (nnet-concat:main():nnet-concat.cc:82) Written model to -
copy-feats scp:exp//train.scp ark:-
LOG (rbm-train-cd1-frmshuff:Init():nnet-randomizer.cc:31) Seeding by srand with : 777
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:141) RBM TRAINING STARTED
LOG (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:144) Iteration 1/8
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:250) Setting momentum 0.9 and learning rate 0.2 after processing 0.000138889h
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:250) Setting momentum 0.9 and learning rate 0.196 after processing 1.38889h
VLOG[1] (rbm-train-cd1-frmshuff:main():rbm-train-cd1-frmshuff.cc:250) Setting momentum 0.9 and learning rate 0.192 after processing 2.77778h
VLOG[1] (rbm-train-cd1-frmshuff:Eval():nnet-loss.cc:244) ProgressLoss[2h/2h]: 2.04231 (Mse)
WARNING (rbm-train-cd1-frmshuff:MallocInternal():cu-device.cc:630) Allocation of 33784 rows, each of size 8192 bytes failed, releasing cached memory and retrying.
WARNING (rbm-train-cd1-frmshuff:MallocInternal():cu-device.cc:637) Allocation failed for the second time. Printing device memory usage and exiting
LOG (rbm-train-cd1-frmshuff:PrintMemoryUsage():cu-device.cc:314) Memory used: 590651392 bytes.
ERROR (rbm-train-cd1-frmshuff:MallocInternal():cu-device.cc:640) Memory allocation failure
WARNING (rbm-train-cd1-frmshuff:Close():kaldi-io.cc:446) Pipe copy-feats scp:exp/
/train.scp ark:- | had nonzero return status 36096
ERROR (rbm-train-cd1-frmshuff:MallocInternal():cu-device.cc:640) Memory allocation failure

[stack trace: ]
kaldi::KaldiGetStackTrace()
kaldi::KaldiErrorMessage::~KaldiErrorMessage()
kaldi::CuAllocator::MallocInternal(unsigned int, unsigned int, unsigned int)
kaldi::CuAllocator::MallocPitch(unsigned int, unsigned int, unsigned int
)
kaldi::CuDevice::MallocPitch(unsigned int, unsigned int, unsigned int*)
.
.
.
kaldi::CuMatrix<float>::CuMatrix(kaldi::CuMatrix<float> const&, kaldi::MatrixTransposeType)
kaldi::nnet1::MatrixRandomizer::AddData(kaldi::CuMatrixBase<float> const&)
rbm-train-cd1-frmshuff(main+0x118a) [0x80d378d]
/lib/i386-linux-gnu/libc.so.6(__libc_start_main+0xf3) [0xb36a3a83]
rbm-train-cd1-frmshuff() [0x80d2561]


This seems to be GPU memory error, do 1 GB GPU card is not enough to run kaldi ??I tried to reduce batch size but it didn't helped.If I run it using CPU, then it is working fine but it is taking lot of time which I want to reduce. 这似乎是GPU内存错误,1GB大小的GPU卡不足以运行kaldi吗?我想减少很多时间。
It looks like your GPU card has only 1G of memory. This may not be enoughfor the model size that you are using. It says:看来您的GPU卡只有1G内存。对于您使用的型号,这可能还不够。它说:
Allocation of 33784 rows, each of size 8192 bytes failed, releasing cached memory and retrying.33784 seems a strange number of rows. Perhaps you have 33784 leaves inyour tree?33784行的分配失败,每个行的大小为8192字节失败,释放了缓存的内存并重试。33784行的数量似乎很奇怪。也许您的树上有33784片叶子?
That's only 200M, but sometimes several copies are needed of things like that (e.g. for the model; the gradient; the minibatch).仅有200M,但有时需要类似的副本(例如,用于模型;渐变;最小批处理)。
Is it like if I reduce the NNET model configuration, then it may work ??Presently I am using 3 layer, 2048 nodes. If I reduce it to 1024, then it may work or I will face problem as go ahead with that configuration (in future layers).就像如果我减少NNET模型配置,那么它可能可以工作?目前我正在使用3层,2048个节点。如果将其减少到1024,那么它可能会起作用,否则在继续进行该配置时(以后的层),我将面临问题。
Yes.. It worked if I reduced network size to 1024 from 2048. GPU has to be upgraded if I need to use more nodes.是的。如果我将网络大小从2048减少到1024,它就可以工作。如果我需要使用更多的节点,则必须升级GPU。

友情链接
版权所有 Copyright(c)2004-2021 锐英源软件
公司注册号:410105000449586 豫ICP备08007559号 最佳分辨率 1024*768
地址:郑州大学北校区院(文化路97号院)内