Very high error rate for this audio clip with my own model

kausthubnaarayan · March 15, 2018, 7:30am

I recorded my own voice to test out deepspeech and want the model to recognise this voice when played back to the model.

So these are the steps which i followed:

1> prepared train.csv, test.csv and dev.csv all having the following single entry:

/Users/kausthub.naarayan/speech_project/long_route.wav|1|the driver took a long route|

2> I then ran this command to train the model against the voice i had given:

python -u DeepSpeech.py \ --train_files ../new_model-2/train-2.csv \ --dev_files ../new_model-2/dev-2.csv \ --test_files ../new_model-2/test-2.csv \ --train_batch_size 80 \ --dev_batch_size 80 \ --test_batch_size 40 \ --n_hidden 375 \ --epoch 33 \ --validation_step 1 \ --early_stop True \ --earlystop_nsteps 6 \ --estop_mean_thresh 0.1 \ --estop_std_thresh 0.1 \ --dropout_rate 0.22 \ --learning_rate 0.00095 \ --report_count 100 \ --use_seq_length False \ --export_dir ../new_model-2/ \ --decoder_library_path ../libctc_decoder_with_kenlm.so \ --alphabet_config_path ../models/alphabet.txt \ --lm_binary_path ../models/lm.binary \ --lm_trie_path ../models/trie \ "$@"

I am using lm.binary and alphabet.txt and trie from the pre trained model given along with DeepSpeech.

This outputs a new model.

3> I run this audio file against this new model and existing language model by the following command:

../DeepSpeech/deepspeech output_graph.pb ../models/alphabet.txt ../models/lm.binary ../models/trie ../long_route.wav

This gives an output which is not even close to what the transcript is.

this is the result i got when i ran:

the rotototogroe

the actual transcript is:
the driver took a long route

Can anyone please help me in finding out what is the mistake i am doing ??

This is the details of the audio file:
Input File : 'long_route.wav’
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:06.41 = 102516 samples ~ 480.544 CDDA sectors
File Size : 205k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM

link to my audio file: https://vocaroo.com/i/s0qUMwH3qqUF

thanks in advance

kdavis · March 15, 2018, 8:21am

What happens when you run bin/run-ldc93s1.sh?

What happens when you replace data/ldc93s1/ldc93s1.csv in bin/run-ldc93s1.sh with your .csv?

kausthubnaarayan · March 15, 2018, 9:35am

@kdavis what happens meaning ?
I ran that and it completed running.
after replacing the csv file, I ran it and it ran successfully, it didnt error out.
Not sure what to look for.

Can you please help ?

Thanks

kdavis · March 15, 2018, 10:53am

If it worked when you replaced data/ldc93s1/ldc93s1.csv with your .csv, it’s working then with your data as far as I can see.

kausthubnaarayan · March 15, 2018, 11:00am

@kdavis the model gets built.
But the output given by the speech to text has more than 60% error.
I think i am doing something wrong, but not sure what or how to improve this ??

Thanks

kdavis · March 15, 2018, 11:06am

What happens if you change the line of run-ldc93s1.sh from

...
 --epoch 50 \
...

to

...
 --epoch 100 \
...

kausthubnaarayan · March 15, 2018, 11:13am

@kdavis I wanted to understand what is the meaning of “loss” which gets printed while training the model ?? the more closer it gets to 0 is better is it ??
So more epoch’s mean more iterations and better learning ??