Overfitting DeepSpeech model on small amount of data

So here’s my problem, I’m trying to create a personal healthcare assistant for healthcare providers. I only have a specific set(23 in total) of commands that the assistant recognizes. I want to overfit the Deepspeech model on these sentences as these are few in number. I want to do this such that I can have a high accuracy on a small amount of data.

How I go about doing that is I have 6 samples of each command from a different speaker. My validation and test data is the training data itself (overfitting right? :stuck_out_tongue: ). However after continuing training for 3 epochs from a pretrained model results in it just predicting the letter h.

The following is the log after training the model:

Computing acoustic model predictions...
100% (46 of 46) |######################################################################################################| Elapsed Time: 0:01:39 Time:  0:01:39
Decoding predictions...
100% (46 of 46) |######################################################################################################| Elapsed Time: 0:01:16 Time:  0:01:16
Test - WER: 10.146552, CER: 3.705833, loss: 196.455765
--------------------------------------------------------------------------------
WER: 24.500000, CER: 96.000000, loss: 166.814606
 - src: "next patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 22.000000, CER: 86.000000, loss: 123.011139
 - src: "whos next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 21.500000, CER: 85.000000, loss: 130.863281
 - src: "next patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 20.500000, CER: 79.000000, loss: 116.693535
 - src: "whos next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 20.333333, CER: 120.000000, loss: 190.458618
 - src: "my first patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 17.000000, CER: 98.000000, loss: 200.597824
 - src: "how many appointments"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 15.500000, CER: 119.000000, loss: 207.080734
 - src: "whos my first patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 14.666667, CER: 85.000000, loss: 176.761948
 - src: "my first patient"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h"
--------------------------------------------------------------------------------
WER: 14.000000, CER: 81.000000, loss: 129.585342
 - src: "who is next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "
--------------------------------------------------------------------------------
WER: 14.000000, CER: 81.000000, loss: 155.495468
 - src: "who is next"
 - res: "h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h h "

If anyone can help out in identifying the problem in either the training steps or the data quantity/quality. I’m also in need of any different solutions anyone can suggest in approaching this specific requirement. Thanks.

Could you please start by making your console output properly formatted as code? It’s hard to distinguish what is output and what is your comment / question.

I’ve updated the code in the post to the appropriate format.Check now please.

1 Like

I guess it’d be very important you also share the command line you use.

Here’s the command I’m running. I’ve downloaded the checkpoints compressed file specified and extracted the contents into the “fine_tuning_checkpoints” directory.

python3 DeepSpeech.py --n_hidden 2048 --fine_tuning_checkpoints/ --epoch -1 --train_files /home/furqan/Projects/assistant/audio_data/downsampled/training.csv --dev_files /home/furqan/Projects/assistant/audio_data/downsampled/training.csv --test_files /home/furqan/Projects/assistant/audio_data/downsampled/training.csv --learning_rate 0.0001

I’m running the code on an Ubuntu machine with 125Gigs of RAM

/furqan/Projects$ uname -a
Linux DS0211 4.15.0-45-generic #48-Ubuntu SMP Tue Jan 29 16:28:13 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

Here you states continuing training for three epochs, but the command line you share is about one extra epoch, that’s inconsistent.

Having experimented on that, I could get very efficient result by just building a language model made of those commands. Maybe you should give a try to that?

Why would you absolutely require re-training ? Are those command using domain specific language that is unlikely to be properly recognized by default ?

Looks like you forgot to specify the --checkpoint_dir flag so it’s starting a new training from scratch rather than fine tuning the release model.

Thanks @reuben. Turns out I was specifying the checkpoint directory the wrong way. I was not giving the --checkpoint_dir argument along with the directory itself.

@lissyx training from scratch on the data I have or by changing the language model is not desirable as the language is still the English language. And the predictions are also part of the english language (if that makes sense).

I can report that the issue has been resolved and after training the model for 5 epochs its now giving near perfect results (still some kinks that need to be ironed out).

I’m not sure what you say here, but I can assure you just making a specific language model with the english model works very well. So I don’t get your point.

As an alternate suggestion, I wouldn’t take the route you are taking, i.e. fine tuning the Deep Speech acoustic model.

What I would do is to simply use the existing acoustic model and create a new language model using only the 23 sentences you expect to hear.

@lissyx has done just this for a demo we made and the results are very good. This direction has the advantage of being much simpler to execute.