锐英源软件
第一信赖

精通

英语

开源

擅长

开发

培训

胸怀四海 

第一信赖

当前位置:锐英源 / 开源技术 / 语音识别开源 / lexicon内缺少word
服务方向
人工智能数据处理
人工智能培训
kaldi数据准备
小语种语音识别
语音识别标注
语音识别系统
语音识别转文字
kaldi开发技术服务
软件开发
运动控制卡上位机
机械加工软件
软件开发培训
Java 安卓移动开发
VC++
C#软件
汇编和破解
驱动开发
联系方式
固话:0371-63888850
手机:138-0381-0136
Q Q:396806883
微信:ryysoft

lexicon内缺少word


My scripts run smoothly on a MacBook. But when I run the same script on Amazon server, it shows this error:我的脚本在MacBook上流畅运行。但是,当我在Amazon服务器上运行相同的脚本时,它显示此错误:

steps/train_mono.sh: Initializing monophone system.
steps/train_mono.sh: Compiling training graphs
bash: line 1: 10155 Aborted (core dumped) ( compile-train-graphs exp/mono0a/tree exp/mono0a/0.mdl data/lang/L.fst "ark:sym2int.pl --map-oov 5 -f 2- data/lang/words.txt < data/train/split1/1/text|" "ark:|gzip -c >exp/mono0a/fsts.1.gz" ) 2>> exp/mono0a/log/compile_graphs.1.log >> exp/mono0a/log/compile_graphs.1.log
run.pl: job failed, log is in exp/mono0a/log/compile_graphs.1.log

And the log file shows:

sym2int.pl: replacing (女性)《 with 5
sym2int.pl: replacing ! with 5
sym2int.pl: replacing 》 with 5
sym2int.pl: replacing (女性たち)《 with 5
sym2int.pl: replacing ! with 5
sym2int.pl: replacing 》 with 5
sym2int.pl: replacing 《 with 5
sym2int.pl: replacing ろうぜき with 5
sym2int.pl: not warning for OOVs any more times
KALDI_ASSERT: at compile-train-graphs:CompileGraphs:training-graph-compiler.cc:194, failed: phone2word_fst.Start() != kNoStateId && "Perhaps you have words missing in your lexicon?"
Stack trace is:
kaldi::KaldiGetStackTrace()
kaldi::KaldiAssertFailure_(char const, char const, int, char const)
kaldi::TrainingGraphCompiler::CompileGraphs(std::vector<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const
, std::allocator<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > > const> > const&, std::vector<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, std::allocator<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >> >)
kaldi::TrainingGraphCompiler::CompileGraphsFromText(std::vector<std::vector<int, std::allocator<int=""> >, std::allocator<std::vector<int, std::allocator<int=""> > > > const&, std::vector<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >, std::allocator<fst::VectorFst<fst::ArcTpl<fst::TropicalWeightTpl<float> > >> >)
compile-train-graphs(main+0x7d0) [0x5b29aa]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf5) [0x7f2d9368dec5]
compile-train-graphs() [0x5b2112]
* Replaced 274 instances of OOVs with 5
# Accounting: time=0 threads=1
# Ended (code 34304) at Mon Dec 8 09:06:22 UTC 2014, elapsed time 0 seconds

So my questions are:

(1) What is the meaning of the above error? Does it mean I need to modify dic or lm?

(2) Why it runs smoothly on Mac but not on Ubuntu?所以我的问题是:

(1)上述错误是什么意思?这是否意味着我需要修改dic或lm?

(2)为什么它可以在Mac上流畅运行,但不能在Ubuntu上流畅运行?

It might be something with the locales or you didn't notice some failure earlier.
Anyway, check if the file data/lang/words.txt exists and how big it is, the same for data/lang/L.fst这可能与语言环境有关,或者您之前没有注意到一些故障

无论如何,请检查data / lang / words.txt文件是否存在及其大小, 与data / lang / L.fst

 

Also you could compare your data/lang/ directory with what you have on your Mac to see where the difference started, and try running
validate_lang.pl. Likely you have a word in your words.txt (and your transcripts) that is not in your lexicon (L.fst), but I don't know why
that would be. If, as Yenda suggests, it turns out to be locale-related (e.g. can be fixed by adding LC_ALL=C in your .bashrc
or .bash_profile), please try to figure out why and let us know. The scripts are supposed to ignore the user's locale, e.g. we export
LC_ALL=C before doing "sort".另外,您可以将data / lang /目录与Mac 上的目录进行比较,以查看差异的开始位置,然后尝试运行 validate_lang.pl。您的word.txt(和 成绩单)中可能有一个不在词典(L.fst)中的单词,但我不知道为什么会这样。如果像Yenda所建议的那样,证明它是与语言环境相关的(例如,可以通过在.bashrc 或.bash_profile中添加LC_ALL = C进行修复),请尝试找出原因并告知我们。这些 脚本应该忽略用户的语言环境,例如,我们 在执行“排序”之前导出LC_ALL = C。


The words.txt (data/lang and data/lang_test) is different, while all other files (L.fst, G.fst, etc.) are identical. If I copy the Mac's words.txt to Server and restart at stage 2 (mono) it runs smoothly. So the problem comes from the generation of words.txt so would you tell me which scripts (in Tedlium example) generate this words.txt ?word.txt(数据/语言和数据/语言测试)不同,而所有其他文件(L.fst,G.fst等)都相同。如果我将Mac的words.txt复制到Server并在第2阶段(单声道)重新启动,它将运行顺利。所以问题出在words.txt的生成上,所以您能告诉我哪些脚本(在Tedlium示例中)生成了words.txt吗?

More precisely, in the server, the words.txt ends with更准确地说,在服务器中,words.txt以

 #0 3229
 <s> 3230
 </s> 3231  

In Mac the words.txt does not have those 2 symbols and just ends with在Mac中,words.txt没有这两个符号,而以

 #0 3229  

I checked prepare_lang.sh and have no idea where the extra 2 lines come from and why it only happens on the server.我检查了prepare_lang.sh,不知道多余的两行来自何处​​以及为什么它仅在服务器上发生。

This Guoguo Chen changes the code to:将代码更改为:

cat $tmpdir/lexiconp.txt | awk '{print $1}' | sort | uniq | awk '
BEGIN {
print "<eps> 0";
}
{
if ($1 == "") {
print " is in the vocabulary!" > "/dev/stderr"
exit 1;
}
if ($1 == "") {
print " is in the vocabulary!" > "/dev/stderr"
exit 1;
}
printf("%s %d\n", $1, NR);
}
END {
printf("#0 %d\n", NR+1);
printf(" %d\n", NR+2);
printf(" %d\n", NR+3);
}' > $dir/words.txt || exit 1;

I was about to ask if your mac version was out of date after looking at your missing integers. The two extra lines were added for a language model rescoring capacity several months ago. You can manually add the two lines to your words.txt.在查看丢失的整数后,我要问您的mac版本是否过时。几个月前,为增加语言模型的容量而增加了两行。您可以手动将这两行添加到word.txt中。

OK, do "svn up" and try running validate_data_dir.sh now; if it gives an error then you'll have to re-do your data preparation. If not we
have to keep debugging.好,执行“ svn up”并尝试立即运行validate_data_dir.sh;如果出现错误,则必须重新进行数据准备。如果没有,我们
必须继续调试。

I mean you can manually add the two lines if you don't want to redo anything. You should probably try things on the same version.我的意思是,如果您不想重做任何事情,则可以手动添加两行。您可能应该尝试在相同版本上进行操作。

I think I have an idea what might be the problem.
We recently added BOS symbol ("") and EOS symbol ("") to the lexicon. These are not supposed to appear in your transcripts, and previously,
if they did appear in your transcripts, they would have automatically been replaced by your designed unkown-word (usually "<UNK>") by
int2sym.pl. However, now that these words are in words.txt, they will be replaced with their own symbols by int2sym.pl, but this leads to an
error because they are not in the lexicon FSTs (L.fst and L_disambig.fst).
Guoguo, I think the easiest fix for this is that you update validate_data_dir.sh to make it an error if these symbols appear in
the text. While you're at it, you could also make it an error if #0 appears in the text. While this won't lead to a crash, it might lead
to unexpected behavior at the very least.我想我可能有什么问题。
我们最近在词典中添加了BOS符号(“ ”)和EOS符号(“ ”)。 这些不应该出现在你的转换脚本,而此前, 如果他们出现在你的转换脚本,他们就已经自动
被你设计的由int2sym.pl生成的符号(通常是“<UNK>”)来代替 。但是,既然这些单词在words.txt中,它们将被int2sym.pl替换为它们自己的符号,但这会导致
错误,因为它们不在词典FST中(L.fst和L_disambig.fst)。
Guoguo,我认为最简单的解决方法是,如果这些符号出现在文本中,则更新validate_data_dir.sh使其出错。
。在执行此操作时,如果文本中出现#0,也可能会出错。尽管这不会导致崩溃,但 至少可能导致意外行为。

友情链接
版权所有 Copyright(c)2004-2021 锐英源软件
公司注册号:410105000449586 豫ICP备08007559号 最佳分辨率 1024*768
地址:郑州大学北校区院(文化路97号院)内