Hi,
A few months ago, I trained a deep speech model for the Hindi-English mixed dataset (mainly Hindi almost 80-90 %) of 1600 hrs of mono audio. I got WER: 0.20, train loss: 35.36, and validation loss: 48.23 and good transcription result on the test data.
When I added 300hrs of new audio(mono converted from stereo using sox) from a similar environment but speech seems spoken fast because of the long duration of the call (but I cut that into 1.5 to 10sec of audio chunks). Now I again trained deep speech from scratch with new 300 hrs of data (total: 1600+300= 1900 hrs).
Now I found that:
- Model early stopped at10th epoch while previously (on 1600 hrs data) early stopped at 13th epoch.
- I got WER: 0.31, train loss: 51.1 and valid loss: 64.20.
I tested this model on two audio, first, new audio from the first 1600 hrs of data and second, data from new 300 hrs. I found the model gives the same transcription as before for the first type of audio, but for the second type of audio (audio from new 300 hrs of data), it skips lots of words and transcription is also very poor.
prediction result improved when I give an audio chunk of almost 1-4 sec and the chunk is amplitude peak normalized through Audacity.
So I then normalized my whole training data(1900 hrs) and trained from scratch again, this time as well model early stopped at 10th epoch, train loss becomes 50.09 from 51.1 and validation loss become 63 from 64.20 (minor changes in losses) and WER become 0.2982, but transcription did not improve.
My question is:
- What was the reason behind the model’s behaviour?
- Why model transcription is better on normalized audio chunks, Do I need to give training audio of the same length?
- Since I trained the model from new data (300 hrs) converted from stereo to mono, the model’s accuracy degraded, Does stereo to mono data conversion affected the accuracy?
Any Help appreciated.
Thanks