Creation of language model and trie

Thank you lissyx for the quick response.

I am working on a prototype which will need an ASR function for Mandarin.
I am trying to train a model based on the DeepSpeech and this data set

Given an awesome work has been done by yuwu that provides the train materials for the above data set needed by DeepSpeech
http://blog.yuwu.me/wp-content/uploads/2018/07/thchs30-csv.tar.gz

I am reusing these materials (alphabet.txt, vocabulary.txt, words.arpa, lm.binary and the trie) to train the model for a quick testing now.

I was able to train the model to reduce the loss to less than 50 by using the latest master branch of DeepSpeech. But when it is ready to exit the training and do test, it throws the following exception

Error: Can’t parse trie file, invalid header. Try updating your trie file.

I guess the trie from yuwu’s result may be out of date, so I build the generate_trie by following https://github.com/mozilla/DeepSpeech/blob/master/native_client/README.md
And then I use the generate_trie command to generate the new trie based on the above yuwu’s alphabet.txt and lm.binary, the new generated trie is only 9 bytes, I don’t know what is wrong. May be the lm.binary is also out of date, I may need regenerate lm.binary as well. But I have not give that a try.

I am wondering if you guys can give me some advice if that is on the correct direction before I try to regenerate the lm.binary.

Thanks

You might be interested in the AISHELL Mandarin dataset: http://www.openslr.org/33/

I just landed an importer for it: https://github.com/mozilla/DeepSpeech/blob/master/bin/import_aishell.py

Thanks Reuben for sharing this!

Do you have the script to generate the alphabet.txt and vocabulary.txt and words.arpa for the data set you linked?

That dataset is just audio/transcript pairs. You could extract the transcripts I guess and build an LM out of that, but it’s not enough text to build a good LM.

Hi,

I am trying to run DeepSpeech on a small data set.
I am using Deepspeech version = 0.5.1

Steps I followed:

  1. Cloned 0.5.1 version from github repository
  2. pip3 install -r requirements.txt
  3. python util/taskcluster.py --arch gpu --target native_client

Then for creating language model:
4. git clone https://github.com/kpu/kenlm.git
5. cd kenlm/
6. Mkdir build
7. Cd build
8. Cmake …
9. Make -j 4

  1. vim alphabet.txt (containing all english alphabets)
  2. vim some.txt (corpus)
  3. …/kenlm/build/bin/lmplz -o 5 <some.txt >lm.arpa
  4. …/kenlm/build/bin/build_binary lm.arpa lm.binary
  5. …/DeepSpeech/native_client/generate_trie alphabet.txt lm.binary trie

Then ran the following script:

python -u DeepSpeech.py
–train_files “/home/dev_ds/deepspeech_dir_2/corpus/corpus-train.csv”
–dev_files “/home/dev_ds/deepspeech_dir_2/corpus/corpus-dev.csv”
–test_files “/home/dev_ds/deepspeech_dir_2/corpus/corpus-test.csv”
–alphabet_config_path “/home/dev_ds/deepspeech_dir_2/my-model/alphabet.txt”
–lm_binary_path “/home/dev_ds/deepspeech_dir_2/my-model/lm.binary”
–lm_trie_path “/home/dev_ds/deepspeech_dir_2/my-model/trie”
–learning_rate 0.001
–dropout_rate 0.05
–word_count_weight 3.5
–log_level 1
–display_step 1
–epoch 75
–export_dir “/home/dev_ds/deepspeech_dir_2/my-model”

I am getting the following error:

I Restored variables from most recent checkpoint at /home/dev_ds/.local/share/deepspeech/checkpoints/train-6300, step 6300
I STARTING Optimization
Epoch 0 | Training | Elapsed Time: 0:12:18 | Steps: 31 | Loss: 178.990274
Epoch 0 | Validation | Elapsed Time: 0:00:18 | Steps: 15 | Loss: 167.763600 | Dataset: /home/dev_ds/deepspeech_dir_2/corpus/corpus-dev.csv
I Saved new best validating model with loss 167.763600 to: /home/dev_ds/.local/share/deepspeech/checkpoints/best_dev-6331
Epoch 1 | Training | Elapsed Time: 0:12:19 | Steps: 31 | Loss: 178.690032
Epoch 1 | Validation | Elapsed Time: 0:00:18 | Steps: 15 | Loss: 167.403382 | Dataset: /home/dev_ds/deepspeech_dir_2/corpus/corpus-dev.csv
WARNING:tensorflow:From /home/dev_ds/deepspeech_dir_2/DeepSpeech-0.5.1/venv2/lib/python3.6/site-packages/tensorflow/python/training/saver.py:966: remove_checkpoint (from tensorflow.python.training.checkpoint_management) is deprecated and will be removed in a future version.
Instructions for updating:
Use standard file APIs to delete files with this prefix.
I Saved new best validating model with loss 167.403382 to: /home/dev_ds/.local/share/deepspeech/checkpoints/best_dev-6362
Epoch 2 | Training | Elapsed Time: 0:12:19 | Steps: 31 | Loss: 178.588967
Epoch 2 | Validation | Elapsed Time: 0:00:18 | Steps: 15 | Loss: 167.700894 | Dataset:

/home/dev_ds/deepspeech_dir_2/corpus/corpus-dev.csv
Epoch 3 | Training | Elapsed Time: 0:12:10 | Steps: 31 | Loss: 178.937192
Epoch 3 | Validation | Elapsed Time: 0:00:18 | Steps: 15 | Loss: 167.505259 | Dataset: /home/dev_ds/deepspeech_dir_2/corpus/corpus-dev.csv
I Early stop triggered as (for last 4 steps) validation loss: 167.505259 with standard deviation: 0.157128 and mean: 167.622626
I FINISHED optimization in 0:50:25.282585
Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
terminate called after throwing an instance of ‘int’

Kindly help me figure out the error.

My generate_trie version was wrong when i had that issue.

Hi. pleaze help me
i did the same as you
I am trying to run DeepSpeech on my small data set.
but i don’t understand the step to create a trie even i use the generate_trie in native_client.tar.xz but i cant successd .
can you please explain in details this step

Welcome to the forum @smalissa17

I suspect you’d do better to re-post this as a distinct topic rather than tack onto the end of an old thread.

Also, I know you think you did the same as the others, but you need to give a lot more specific detail about exactly what you did before others will have even a remote chance of being able to help. Imagine it was someone else posting like you did and you had to help them - you see how opaque your request is?

Anyway, best of luck getting this resolved :slightly_smiling_face:

thank you.
ok i will do that