Hi All,
I am trying to train and use a model for English from scratch on version 0.5.1. My aim is to train two models, one with and another without a language model. Request your help on several fronts please. Sorry this is long but trying be as detailed as possible; and also, being new to Linux and data-science I may be stating some very obvious things.
Thank you in advance for your help.
Part A) My Questions
Part B) Background info
Regards,
Rohit
Part A) My Questions
A1) When using a language model either for training or inference, do I HAVE to specify the lm_binary parameter AND the corresponding trie file? Can using only the lm_binary or trie parameter work?
A2) Say I train two models on same data. For first model with an LM specified (built using KenLM library on the vocabulary of transcripts used for training data, and specifying lm_binary and trie parameters). The second model is trained without any LM parameters. Later I use each of these models for inference. Can I choose to use OR not use a language model during the inference stage? Can a different language model be used during inference or should one use the same LM used in training? Are there things to note while choosing an alternative model? E.g. training using a 3-gram model but using a 4-gram model during inference? etc…
A3) I am facing a problem when I try to use a different LM from the one used during training. My model is trained with only 1k data points. The LM used was built using same 1k data points as vocabulary and a 4-gram lm_binary and trie was specified during training.
Inference works but is understandably very poor. Console Output:
(dpsp5v051basic) rohit@DE-W-0246802:~/dpspCODE/v051/DeepSpeech$ deepspeech
–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav
Loading model from file /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-08-01 16:11:02.155443: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-01 16:11:02.179690: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-08-01 16:11:02.179740: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:02.179756: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:02.179891: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 0.0283s.
Loading language model from files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/lm/lm4gram/vocabulary-Set5First1050_4gram.klm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/trie/trie4gram/Set5First1050_4gram.trie
Loaded language model in 0.068s.
Running inference.
a on a in a is the
Inference took 0.449s for 3.041s audio file.
Now I want to use an LM created from a larger vocabulary file of say 600k data points (transcripts), which in this case does include the 1k wav files that were used as data for training. This is from the validated.tsv file of the CommonVoice2 corpus. I have double checked that the alphabet.txt for the first 1k data points vocabulary and the larger 600k vocabulary are identical. Also I have created the lm_binary and trie files (allValidated_o4gram.klm, allValidated_o4gram.trie) as 4-gram versions. Thus basic specs of the LM match the one used for training.
But while using the larger LM during inference I get an error saying “Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.”. Is it still loading the larger LM? Did Deepspeech actually pick it up and apply it correctly? How do I fix this error please?
Console output:
(dpsp5v051basic) rohit@DE-W-0246802:~/dpspCODE/v051/DeepSpeech$ deepspeech
–model /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
–alphabet /home/rohit/dpspTraining/data/wavFiles/commVoiceSet5-1kTotal/alphabetDir/alphabet-Set5First1050.txt
–lm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm
–trie /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
–audio /home/rohit/dpspTraining/data/wavFiles/wav33/test/File28.wav
Loading model from file /home/rohit/dpspTraining/models/v051/model8-validFirst1k-yesLM-4gram/savedModel/output_graph.pb
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-08-01 16:11:58.305524: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-08-01 16:11:58.322902: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “CPU”’) for unknown op: UnwrapDatasetVariant
2019-08-01 16:11:58.322945: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:58.322956: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “WrapDatasetVariant” device_type: “CPU”’) for unknown op: WrapDatasetVariant
2019-08-01 16:11:58.323063: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel (‘op: “UnwrapDatasetVariant” device_type: “GPU” host_memory_arg: “input_handle” host_memory_arg: “output_handle”’) for unknown op: UnwrapDatasetVariant
Loaded model in 0.0199s.
Loading language model from files /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/lm/lm4gram/vocabulary-allValidated_o4gram.klm /home/rohit/dpspTraining/data/wavFiles/testVocabAllValidated/trie/trie4gram/allValidated_o4gram.trie
Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
Loaded language model in 0.00368s.
Running inference.
an on o tn o as te tee
Inference took 1.893s for 3.041s audio file.
Note that the input audio is same File28.wav but the output transcript varies with different LMs:
a on a in a is the (smaller LM used in training and inference) vs
an on o tn o as te tee (using different larger LM for inference only)
A) Background:
A1) Ubuntu 18.04LTS, no GPU, 32GB ram, Deespeech v0.5.1 git repo.
- Downloaded Mozilla Common Voice Corpus (English) around mid-June 2019.
- Took the validated.tsv file, did some basic transcript validation and pruned dataset to 629731 entries.
- Selected first 10k entries and split using ratio of 70:20:10 as train:dev:test and created csv files.
- MP3s converted to wav files (16kHz, mono, 16bit), length less than 10 seconds.
Setup Anaconda environment with Deepspeech v0.5.1. - Cloned github v0.5.1 code.
- Issued command in the Deepspeech folder, which seems to be required to create the generate_trie executable and other required setup:
python util/taskcluster.py --target .
- Installed the CTC-decoder from the link obtained from command:
python util/taskcluster.py --decoder
- Next created vocabulary file with only the transcripts.
- No changes in any of the flags and other default parameters.
A2) Language model related:
- Used KenLM. Downloaded from git repo and compiled. Commands to create 4-gram version:
- vocabulary file to arpa:
./lmplz -o 4 --text /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k.txt --arpa /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k_4gram.arpa
- arpa to lm_binary file:
./build_binary /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/vocabDir/vocabulary-Set3First10k_4gram.arpa /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm
- used the generate_trie to make the trie file
/home/rohit/dpspCODE/v051/DeepSpeech/generate_trie /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/trie/trie4gram/set3First10k_4gram.trie
- Note the trie file was made successfully and later used to start training.
A3) Commands to start model training (training in progress still):
A3a) Model without language model:
python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/test.csv
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–n_hidden 2048
–epoch 20
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model5-validFirst10k-noLM/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model5-validFirst10k-noLM/checkpointDir
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt
“$@”
A3b) Model with Language model:
python3 -u DeepSpeech.py
–train_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/train.csv
–dev_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/dev.csv
–test_files /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/csvFiles/test.csv
–train_batch_size 1
–dev_batch_size 1
–test_batch_size 1
–n_hidden 2048
–epoch 20
–dropout_rate 0.15
–learning_rate 0.0001
–export_dir /home/rohit/dpspTraining/models/v051/model6-validFirst10k-yesLM-4gram/savedModel
–checkpoint_dir /home/rohit/dpspTraining/models/v051/model6-validFirst10k-yesLM-4gram/checkpointDir
–decoder_library_path /home/rohit/dpspCODE/v051/DeepSpeech/native_client/libctc_decoder_with_kenlm.so
–alphabet_config_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/alphabetDir/alphabet-Set3First10k.txt
–lm_binary_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/lm/lm4gram/vocabulary-Set3First10k_4gram.klm
–lm_trie_path /home/rohit/dpspTraining/data/wavFiles/commVoiceSet3-10kTotal/trie/trie4gram/set3First10k_4gram.trie
“$@”
Thank you for your time! Regards.