DeepSpeech Training Problems for Brazilian Portuguese

edresson1 · January 2, 2020, 12:09pm

Hi,

I am trying to train the DeepSpeech model for Brazilian Portuguese,
In Brazilian Portuguese there are few datasets available (here a work which used 14 hours of speech).

I was able to get a 109 hour dataset in Brazilian Portuguese and I am trying to train DeepSpeech in this dataset (the dataset is spontaneous speaking and was collected from sociolinguistic interviews and was completely manually transcribed by humans)

For creating LM and trie I followed the documentation recommendations:
I created words.arpa with the following command (RawText.txt contains all the transcripts (but the wav file paths have been removed from this file):

./lmplz --text ../../datasets/ASR-Portuguese-Corpus-V1/RawText.txt --arpa /tmp/words.arpa --order 5 --temp_prefix /tmp/

I generated lm.binary:
kenlm/build/bin/build_binary -a 255 -q 8 trie lm.arpa lm.binary

I installed the native client:
python util/taskcluster.py --arch gpu --target native_client --branch v0.6.0

I created the file alphabet.txt with the following:

`# Each line in this file represents the Unicode codepoint (UTF-8 encoded)# associated with a numeric label.# A line that starts with # is a comment. You can escape it with # if you wish # to use ‘#’ as a label.

a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ç
ã
à
á
â
ê
é
í
ó
ô
õ
ú
û`

After I generated the trie:
DeepSpeech/native_client/generate_trie ../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt lm.binary trie

After I trained the model with the following command:

  --train_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_train.csv \
  --checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \
  --test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \
  --alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \
  --lm_binary_path  ../../datasets/deepspeech-data/lm.binary \
  --lm_trie_path ../../datasets/deepspeech-data/trie \
  --train_batch_size 2 \
  --test_batch_size 2 \
  --dev_batch_size 2 \
  --export_batch_size 2 \
  --epochs 200 \
  --early_stop False \

Previously I trained the model with early_stop (specifying dev_files), however the model stopped training after 4 epochs, so I removed the early stop. Both the 50 and 4 epochs models have the same results.
I run the test using the following command:

python evaluate.py \
  --checkpoint_dir ../deepspeech_v6-0-0/checkpoints/ \
  --test_files ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv \
  --alphabet_config_path ../../datasets/ASR-Portuguese-Corpus-V1/alphabet.txt \
  --lm_binary_path  ../../datasets/deepspeech-data/lm.binary \
  --lm_trie_path ../../datasets/deepspeech-data/trie

The result was:

INFO:tensorflow:Restoring parameters from ../deepspeech_v6-0-0/checkpoints/train-2796891
I0102 09:06:41.871738 139898013472576 saver.py:1280] Restoring parameters from ../deepspeech_v6-0-0/checkpoints/train-2796891
I Restored variables from most recent checkpoint at ../deepspeech_v6-0-0/checkpoints/train-2796891, step 2796891
Testing model on ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                                                                                                                  2020-01-02 09:06:42.344339: W tensorflow/compiler/jit/mark_for_compilation_pass.cc:1412] (One-time warning): Not using XLA:CPU for cluster because envvar TF_XLA_FLAGS=--tf_xla_cpu_global_jit was not set.  If you want XLA:CPU, either set that envvar, or use experimental_jit_scope to enable XLA:CPU.  To confirm that XLA is active, pass --vmodule=xla_compilation_cache=1 (as a proper command-line flag, not via TF_XLA_FLAGS) or set the envvar XLA_FLAGS=--xla_hlo_profile.
2020-01-02 09:06:42.384953: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcublas.so.10.0
2020-01-02 09:06:42.537285: I tensorflow/stream_executor/platform/default/dso_loader.cc:42] Successfully opened dynamic library libcudnn.so.7
Test epoch | Steps: 199 | Elapsed Time: 0:01:28                                                                                                                                                
Test on ../../datasets/ASR-Portuguese-Corpus-V1/metadata_test_200.csv - WER: 0.956973, CER: 0.852231, loss: 101.685509
--------------------------------------------------------------------------------
WER: 4.000000, CER: 2.333333, loss: 54.671597
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/53999_nurc_.wav
 - src: "lá "
 - res: "e a e a "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 0.666667, loss: 32.827530
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/17216_nurc_.wav
 - src: "revistas "
 - res: "e a "
--------------------------------------------------------------------------------
WER: 1.200000, CER: 0.739130, loss: 79.709518
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/60600_nurc_.wav
 - src: "num não me animo muito "
 - res: "e a a a a a "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 8.319281
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/33267_sp_.wav
 - src: "é "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 11.219957
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/37622_sp_.wav
 - src: "né "
 - res: "e"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 11.632010
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/29378_nurc_.wav
 - src: "é "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.500000, loss: 12.242241
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/37172_nurc_.wav
 - src: "é "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 1.000000, loss: 13.220651
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/62827_sp_.wav
 - src: "não "
 - res: "e"
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 14.941595
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/844_nurc_.wav
 - src: "mas "
 - res: "e "
--------------------------------------------------------------------------------
WER: 1.000000, CER: 0.750000, loss: 14.989404
 - wav: file:///media/edresson/5bef138d-5bcc-41af-a3f0-67c9bd0032c4/edresson/DD/datasets/ASR-Portuguese-Corpus-V1/data/22739_sp_.wav
 - src: "uhn "
 - res: "e "
--------------------------------------------------------------------------------

The model often transcribes the letter “e”, the use of this letter is very frequent in the dataset.

Am I doing something wrong?

How can I check if my lm.binarry and trie are correct?

Does anyone have any suggestions?

Best Regards,

lissyx · January 2, 2020, 12:10pm

It seems an obvious case of “model has learnt nothing”. The fact that early stop triggered so soon is also a hint.

You likely have to adapt hyper-parameters to your dataset.

reuben · January 7, 2020, 1:39pm

In particular with so little data I would start by reducing n_hidden dramatically. Try 1024, 768, 512.

reuben · January 7, 2020, 1:40pm

For the language model, the OSCAR dataset has 64GB of Portuguese text: https://traces1.inria.fr/oscar/

edresson1 · January 7, 2020, 1:55pm

Thanks so much for your reply :), I will soon recreate the language model.

At the moment I mapped the accents from Portuguese to 'letter (example mapping: ç ->'c, the letter á -> 'a). After I was able to use transfer learning from the pre-trained English model and I’m in epoch 27 with early stop and the model continues to decrease the loss. I had already done something similar in Voice Synthesis, and models that did not converge on a small base in Portuguese began to converge. When the training is over I will update them on the result.

reuben · January 7, 2020, 2:00pm

Nice, that’s really cool! We’ve attempted some transfer learning experiments with the English model before but nothing that worked very well with only a hundred hours of data or so. If this works for you it’d be great to know.

edresson1 · January 8, 2020, 11:29am

Hi, with transfer learning I didn’t get a good result (Test on …/…/datasets/ASR-Portuguese-Corpus-V1/metadata_test_200-tl.csv - WER: 0.749439, CER: 0.467458, loss: 60.035717), but it was better than the previous one. The model hardly learns accents (it is a difficult task since an ’ in front of the letter means another letter/accentuation). The model wrongly predicts most transcripts.

In the next version of DeepSpeech could you train the English model with all the accents of Portuguese and Spanish?

I believe this would help in transfer learning in these languages.