Hi,
I am trying to train the DeepSpeech model for Brazilian Portuguese,
In Brazilian Portuguese there are few datasets available (here a work which used 14 hours of speech).
I was able to get a 109 hour dataset in Brazilian Portuguese and I am trying to train DeepSpeech in this dataset (the dataset is spontaneous speaking and was collected from sociolinguistic interviews and was completely manually transcribed by humans)
For creating LM and trie I followed the documentation recommendations:
I created words.arpa with the following command (RawText.txt contains all the transcripts (but the wav file paths have been removed from this file):
./lmplz --text ../../datasets/ASR-Portuguese-Corpus-V1/RawText.txt --arpa /tmp/words.arpa --order 5 --temp_prefix /tmp/
I generated lm.binary:
kenlm/build/bin/build_binary -a 255 -q 8 trie lm.arpa lm.binary
I installed the native client:
python util/taskcluster.py --arch gpu --target native_client --branch v0.6.0
I created the file alphabet.txt with the following:
`# Each line in this file represents the Unicode codepoint (UTF-8 encoded)# associated with a numeric label.# A line that starts with # is a comment. You can escape it with # if you wish # to use ‘#’ as a label.
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ç
ã
à
á
â
ê
é
í
ó
ô
õ
ú
û`
After I generated the trie:
DeepSpeech/native_client/generate_trie ../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt lm.binary trie
After I trained the model with the following command:
--train_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_train.csv \
--checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \
--test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \
--alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \
--lm_binary_path ../../datasets/deepspeech-data/lm.binary \
--lm_trie_path ../../datasets/deepspeech-data/trie \
--train_batch_size 2 \
--test_batch_size 2 \
--dev_batch_size 2 \
--export_batch_size 2 \
--epochs 200 \
--early_stop False \
Previously I trained the model with early_stop (specifying dev_files), however the model stopped training after 4 epochs, so I removed the early stop. Both the 50 and 4 epochs models have the same results.
I run the test using the following command:
python evaluate.py \
--checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \
--test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \
--alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \
--lm_binary_path ../../datasets/deepspeech-data/lm.binary \
--lm_trie_path ../../datasets/deepspeech-data/trie
The result was:
INFO:tensorflow:Restoring parameters from ../deepspeech_v6-0-0/checkpoints/train-2796891
I0102 09:06:41.871738 139898013472576 saver.py:1280] Restoring parameters from ../deepspeech_v6-0-0/checkpoints/train-2796891
I Restored variables from most recent checkpoint at ../deepspeech_v6-0-0/checkpoints/train-2796891, step 2796891
Testing model on ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00 2020-01-02 09:06:42.344339: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set. If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU. To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2020-01-02 09:06:42.384953: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-01-02 09:06:42.537285: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
Test epoch | Steps: 199 | Elapsed Time: 0:01:28
Test on ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv - WER: 0.956973, CER: 0.852231, loss: 101.685509
--------------------------------------------------------------------------------
WER: 4.000000, CER: 2.333333, loss: 54.671597
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/53999_nurc_.wav
- src: "lá "
- res: "e a e a "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.666667, loss: 32.827530
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/17216_nurc_.wav
- src: "revistas "
- res: "e a "
--------------------------------------------------------------------------------
WER: 1.200000, CER: 0.739130, loss: 79.709518
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/60600_nurc_.wav
- src: "num não me animo muito "
- res: "e a a a a a "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 8.319281
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/33267_sp_.wav
- src: "é "
- res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 11.219957
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/37622_sp_.wav
- src: "né "
- res: "e"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 11.632010
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/29378_nurc_.wav
- src: "é "
- res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 12.242241
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/37172_nurc_.wav
- src: "é "
- res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 13.220651
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/62827_sp_.wav
- src: "não "
- res: "e"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 14.941595
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/844_nurc_.wav
- src: "mas "
- res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 14.989404
- wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/22739_sp_.wav
- src: "uhn "
- res: "e "
--------------------------------------------------------------------------------
The model often transcribes the letter “e”, the use of this letter is very frequent in the dataset.
Am I doing something wrong?
How can I check if my lm.binarry and trie are correct?
Does anyone have any suggestions?
Best Regards,