Fine tuning 0.5.1 - Do I need to create a lm.binary and trie file for the training for common voice or can I use language model already in 0.5.1

I am fine tuning 0.5.1 on common voice dataset. The language model used while training

  1. can it be same as provided in 0.5.1
    OR
  2. do I need to create a LM for common voice separately.
    OR
  3. Do I need to merge librispeech LM and common voice sentences to generate a new LM.
    Please suggest.

There is no need to re-create a new LM, you should be able to re-use the default one.

I am trying to fine tune the Deep Speech 0.5.1 model with deepspeech-0.5.1-checkpoint.tar.gz downloaded.

The code snippet is as below :arrow_down:

source myenv/bin/activate

cd DeepSpeech-0.5.1/

pip3 install -r requirements.txt
pip3 install tensorflow-gpu==1.13.1
pip3 install $(python util/taskcluster.py --decoder)

python util/taskcluster.py --arch gpu --target native_client

# Creating a LM with kenlm

cd ..

git clone https://github.com/kpu/kenlm.git
cd kenlm/ 
mkdir build
cd build/
cmake ..
make -j 4

cd ../../my-model/

../kenlm/build/bin/lmplz  -o 5 <some.txt >lm.arpa

../kenlm/build/bin/build_binary lm.arpa lm.binary

../DeepSpeech-0.5.1/native_client/generate_trie alphabet.txt lm.binary trie

cd ../DeepSpeech-0.5.1/

nohup python3 -u DeepSpeech.py \
   --train_files "/home/dev_ds/deepspeech_dir_1/corpus/corpus-train.csv" \
   --dev_files "/home/dev_ds/deepspeech_dir_1/corpus/corpus-dev.csv" \
   --test_files "/home/dev_ds/deepspeech_dir_1/corpus/corpus-test.csv" \
   --alphabet_config_path "/home/dev_ds/deepspeech_dir/deepspeech-0.5.1-models/alphabet.txt" \
   --lm_binary_path "/home/dev_ds/deepspeech_dir/my-model/lm.binary" \
   --lm_trie_path "/home/dev_ds/deepspeech_dir/my-model/trie" \
   --checkpoint_dir /home/dev_ds/deepspeech_dir/deepspeech-0.5.1-checkpoint/ \
   --train_batch_size 48 \
   --dev_batch_size 4 \
   --test_batch_size 4 \
   --learning_rate 0.00005 \
   --export_dir "/home/dev_ds/deepspeech_dir_1/my-model/" \
&>> new_spkr.log &
---------------------------------------------------------------------------

But log file has following error at the end :point_down:

Error: Trie file version mismatch (4 instead of expected 3). Update your trie file.
terminate called after throwing an instance of 'int'

But when I use the lm.binary and trie files from deepspeech-0.5.1-models.tar.gz , output_graph.pb is successfully getting exported. But I am not satisfied with the model performance using that LM.

This issue is already discussed in other threads, but I could not get proper solution. Let me know what is wrong with native_client and/or generate_trie.

myenv specifications :slight_smile:

pip3 freeze

absl-py==0.7.1
asn1crypto==0.24.0
astor==0.8.0
attrdict==2.0.1
audioread==2.1.8
bcrypt==3.1.7
beautifulsoup4==4.8.0
bs4==0.0.1
certifi==2019.6.16
cffi==1.12.3
chardet==3.0.4
cryptography==2.7
cycler==0.10.0
decorator==4.4.0
deepspeech==0.5.1
deepspeech-gpu==0.5.1
ds-ctcdecoder==0.5.1
gast==0.2.2
grpcio==1.23.0
h5py==2.9.0
idna==2.8
joblib==0.13.2
Keras-Applications==1.0.8
Keras-Preprocessing==1.1.0
kiwisolver==1.1.0
librosa==0.7.0
llvmlite==0.29.0
Markdown==3.1.1
matplotlib==3.1.1
mock==3.0.5
numba==0.45.1
numpy==1.15.4
pandas==0.25.1
paramiko==2.6.0
progressbar2==3.43.1
protobuf==3.9.1
pycparser==2.19
PyNaCl==1.3.0
pyparsing==2.4.2
python-dateutil==2.8.0
python-utils==2.3.0
pytz==2019.2
pyxdg==0.26
requests==2.22.0
resampy==0.2.2
scikit-learn==0.21.3
scipy==1.3.1
six==1.12.0
SoundFile==0.10.2
soupsieve==1.9.3
sox==1.3.7
tensorboard==1.13.1
tensorflow-estimator==1.13.0
tensorflow-gpu==1.13.1
termcolor==1.1.0
urllib3==1.25.3
Werkzeug==0.15.5
wget==3.2 

– Thank you

There are issues with model’s performance and accuracy. I tried fine tuning it over common voice with lr 1e-6 for 7 epochs with checkpoint lm and still the model is no way near accurate. It keeps throwing random words for phrases super obvious in the audio.

The error already explicit that, you have built a mismatching version for the LM. Since you don’t document how you proceed, I cannot tell you what is wrong.

util/taskcluster.py by default will download the latest (master) version of the artifact, even though you’re running the code from v0.5.1. This means you’re using a newer incompatible version of generate_trie. You have to specify the version with --branch v0.5.1.

Thank you Reuben Morais, it worked finally :+1: