精通
英语
和
开源
,
擅长
开发
与
培训
,
胸怀四海
第一信赖
In order to training chinese AM model used in ASR, We want to set up an DNN AM training platform based on kaldi. We want to use GPUs , and now we have 2000 hours speech data. How many GPUs we probably need when the training time is in a acceptability time (possibly one week) ?
According to 'http://kaldi.sourceforge.net/dnn.html', Dan's example scripts support using multiple GPUs. How to set up multiple GPUs structure training platform? Is there a specific documentation?
为了训练用于ASR的中文AM模型,我们希望建立一个基于kaldi的DNN AM训练平台。我们想使用GPU,现在我们有2000个小时的语音数据。当训练时间在可接受的时间内(可能是一周),我们可能需要多少个GPU?
根据“ http://kaldi.sourceforge.net/dnn.html”,Dan的示例脚本支持使用多个GPU。如何建立多个GPU结构训练平台?有专门的文件吗?
Most of the example scripts use 4 or 6 GPUs. You could probably get away with 2, but I'd recommend to get 4.
You will need to install GridEngine if the GPUs are on multiple machines.
On Debian it's easy as it's in a package (gridengine-master for the master, and gridengine-exec for the others, IIRC). In the 'kluster' project on
Sourceforge there some notes on the way I configure it normally, e.g. adding the 'ram_free' resource, but that is not 100% necessary. The main
thing is to add the 'gpu' resource so GridEngine can keep track of how many GPUs you have on your cluster. Basically you do 'qconf -mc' and add a line
like this:
gpu g INT <= YES YES 0
0
and then if you have execution hosts foo1 and foo2 in your queue, each with two GPUs, you would do
qconf -me foo1
and add ",gpu=2"
and likewise for foo2.
The training time for 2000 hours of data with 4 GPUs depends on the number of epochs and the model size, but if you use --num-epochs 3
--num-epochs-final 1, I would guess it would take about 48 to 72 hours. Use train_pnorm_fast.sh, not train_pnorm.sh which is slower.You need to have NFS set up so different hosts can see each others' disks. Ideally you'll have multiple hosts exporting their disks over NFS, and you can then stripe the data over disks using utils/create_split_dir.sh.
You need at least gigabit ethernet, preferably more, and make sure you have a good network topology so there aren't bottlenecks. (At Hopkins we have machines all connected via gigabit ethernet to a single large switch; if we
had two switches connected to each other by a gigabit link there would be a problem).All of this needs some familiarity with UNIX system administration, I'm afraid. If you don't have someone in your group who knows this stuff,
you'll be in trouble.大多数示例脚本使用4或6个GPU。您可能会放弃2,但我建议得到4。 如果GPU在多台计算机上,则需要安装GridEngine。
在Debian上,这很容易,因为它在一个软件包中(对于主服务器,是gridengine- master,对于其他人,是IIRC ,它是gridengine-exec)。在
Sourceforge的“ kluster”项目中,有一些关于我通常配置方式的说明,例如, 添加了“ ram_free”资源,但这不是100%必要的。最主要
的是添加“ gpu”资源,以便GridEngine可以跟踪集群上有多少个GPU。基本上,您执行'qconf -mc'并添加如下一行 :
gpu g INT <= YES YES 0
0
,然后如果队列中有两个GPU的执行主机foo1和foo2 ,则可以执行
qconf -me foo1
并添加“,gpu = 2”
,对于foo2也是如此。使用4个GPU进行2000小时数据的训练时间取决于时间段的数量和模型的大小,但是如果您使用--num-epochs 3 --num-epochs-final 1,我估计将花费大约48至72小时。
请使用train_pnorm_fast.sh,而不要使用train_pnorm.sh,因为它比较慢。您需要设置NFS,以便不同的主机可以看到彼此的磁盘。 理想情况下,将有多个主机通过NFS导出磁盘,然后可以使用utils / create_split_dir.sh在磁盘上对数据进行条带化。
您至少需要千兆以太网,最好是更多千兆以太网,并确保您具有良好的网络拓扑,因此不会出现瓶颈。(在霍普金斯大学,我们的
机器全部通过千兆以太网连接到单个大型交换机;如果我们有两个通过千兆链路相互连接的交换机,则会出现问题)。
Could you tell me more about your DNN/CUDA environment setting? For example, What kind of CPUs, GPUs and machines you have? I would like to know how to finish DNN training (2000 hours data) in about 48~72 hours.
I now have two GTX 780Ti cards (with SLI) in an i7-4930 machine. But it is somehow not fast enough. Should I go for GTX Titan (or other more expensive cards) for more double precision computation power? In other words, do the DNN codes in Kaldi use heavily the double precision?
Thanks for your help and have a nice day!
您能告诉我更多有关DNN / CUDA环境设置的信息吗?例如,您拥有哪种CPU,GPU和计算机?我想知道如何在48〜72小时内完成DNN训练(2000小时数据)。
我现在在i7-4930机器上有两张GTX 780Ti卡(带有SLI)。但这还不够快。我是否应该购买GTX Titan(或其他价格更高的显卡)以获得更高的双精度计算能力?换句话说,Kaldi中的DNN代码是否大量使用了双精度?
We have Tesla K10s (IIRC, each card appears as 2 GPUs, so this counts as 2
GPUs). You can check online how the flops compare with the GTX 780Ti
cards. It's single precision flops you have to be concerned about, the DNN
training does not use double precision operations.
I think when I spoke of running in 48 hours of 2000 hours data, I probably
meant using 6 or 8 GPUs, since that's the normal number of GPUs we use to
train on that amount of data. So if you just have 2 GPUs, it would take 3
or 4 times longer.
Of course it's possible you are limited by network bandwidth, e.g. this is
likely if you are accessing your data over NFS and you don't have 10G
ethernet or better. [If you do "top" and the training process is using
less than 100% CPU, this is likely.]
Or it's possible that you did not run "configure" on a machine with 'nvcc'
on the path, so it's not using the GPU at all. If it is using the GPU, it
should tell you near the top of the log file which GPU it's using (0 or 1)
and how much memory it has. If it says nothing like this, then likely you
haven't compiled for GPU use, or you are invoking the program with
'--use-gpu=no'. You can tell from the exit status of the command
'cuda-compiled' whether you have compiled for GPU use: 0 -> compiled for
GPU, 1 -> not compiled for GPU.
nvidia-smi will tell you GPU utilization, which should be at least greater
than 50% if it's working correctly.
我们有Tesla K10(IIRC,每张卡显示为2个GPU,因此算作2个GPU)。您可以在线查看触发器与GTX 780Ti
卡的比较。您必须关注的是单精度触发器,DNN
训练不使用双精度运算。
我认为当我谈到要在2000小时的数据中运行48小时时,我可能打算使用6或8个GPU,因为这是我们用来训练该数量数据的正常GPU数量。因此,如果您只有2个GPU,则需要花费3或4倍的时间。
当然,您可能会受到网络带宽的限制,例如,如果通过NFS访问数据并且没有10G或更高的以太网,则很可能会受到限制。 [如果您执行“ top”操作,并且训练过程使用的CPU少于100%,则可能是这种情况。]
或者您可能没有在路径上带有“ nvcc”的计算机上运行“ configure” ,因此它没有使用GPU。如果使用的是GPU,它应该在日志文件顶部附近告诉您它正在使用哪个GPU(0或1)
以及它有多少内存。如果没有这样的提示,则可能是您尚未针对GPU进行编译,或者正在使用
'--use-gpu = no'调用程序。您可以从命令“ cuda-compiled”的退出状态中得知是否已针对GPU使用进行了编译:0->针对
GPU进行了编译,1->未针对GPU进行了编译。
nvidia-smi会告诉您GPU的利用率,
如果运作正常,则超过50%。
Thanks a lot! You even worked hard in the weekend.
I will try to build a cluster (with 4 780Ti cards) this week to speed up the DNN training procedure. Could I ask three more questions?
1. How long it will take in your case to finish the example script "egs/wsj/s5/local/run_nnet2.sh"? And how about the time to run just the primary receipt "local/nnet2/run_5d.sh"?
2. Since most of the example scripts use 4 or 6 GPUs and I have now only two GPUs, should I modify the example scripts?
3. Should I set the GPUs to process or even thread exclusive mode?
非常感谢!你甚至在周末努力工作。
我将在本周尝试构建一个集群(带有4 780Ti卡),以加快DNN培训过程。我能再问三个问题吗?
1.在您的情况下,完成示例脚本“ egs / wsj / s5 / local / run_nnet2.sh”需要多长时间?那么只运行主要收据“ local / nnet2 / run_5d.sh”的时间如何?
2.由于大多数示例脚本使用4个或6个GPU,而我现在只有两个GPU,我是否应该修改示例脚本?
3.是否应将GPU设置为处理或线程独占模式?
You could probably run the neural-net training script using run.pl as the $cmd variable if you have just one server with two GPUs. In this case there is not much setting up to do. Make sure your GPUs are not the kind with just one or two gigs of RAM.
如果只有一台服务器带有两个GPU,则可以使用run.pl作为$ cmd变量来运行神经网络训练脚本。在这种情况下
,没有太多要做的设置。
确保您的GPU不是只有一两个RAM的那种GPU。