精通
英语
和
开源
,
擅长
开发
与
培训
,
胸怀四海
第一信赖
When we trained p-norm DNN using steps/nnet2/train_multisplice_accel2.sh, the training server lost power. The program has been running for a few days. Is there a way to make the program to continue from where it left off? How?当我们使用steps / nnet2 / train_multisplice_accel2.sh训练p范数DNN时,训练服务器掉电了。该程序已经运行了几天。有没有办法使程序从中断处继续运行?怎么样?
yes, there is. Most of the scripts have parameter "stage" which you can use to resume from a certain point.
In your case the stage would correspond to the DNN training iteration -- if you look into the target directory you will see a bunch of
<some_number>.mdl files -- the largest number +1 is your stage.
so you will run the same script using the same parameters and you will add --stage <stage no=""> as an additional parameter.
If, for some reason, the last file would be damaged, just ignore that one and use the number from the file.就在这里。大多数脚本具有参数“ stage”,您可以使用该参数从特定点恢复。
在您的情况下,该阶段将与DNN训练迭代相对应-如果您查看目标目录,您将看到一堆<some_number> .mdl文件-最大的+1是您的阶段。
因此您将使用相同的参数运行相同的脚本,并将 --stage <stage no =“”>添加为附加参数。
如果由于某种原因最后一个文件将被损坏,则只需忽略该文件并使用文件中的编号即可。
Actually I think the stage should be the largest .mdl number, not the
largest number + 1.Actually I think the stage should be the largest .mdl number, not the
largest number + 1.其实我觉得在阶段应该是最大.mdl文件号,而不是数量最多+ 1
Now, the largest number is 863.mdl. Should I set the --stage value to be 864 to make the program to continue?现在,最大的数字是863.mdl。我是否应将--stage值设置为864以使程序继续运行?
correction:更正:
Should I set the --stage value to be 863 to make the program to continue?我是否应将--stage值设置为863,以使程序继续运行?
Yes.