Training and validation loss increases and transcription worsen when added few hours of new audio data of same environment

nasim.alam086 · November 17, 2019, 7:19pm

Hi,
A few months ago, I trained a deep speech model for the Hindi-English mixed dataset (mainly Hindi almost 80-90 %) of 1600 hrs of mono audio. I got WER: 0.20, train loss: 35.36, and validation loss: 48.23 and good transcription result on the test data.

When I added 300hrs of new audio(mono converted from stereo using sox) from a similar environment but speech seems spoken fast because of the long duration of the call (but I cut that into 1.5 to 10sec of audio chunks). Now I again trained deep speech from scratch with new 300 hrs of data (total: 1600+300= 1900 hrs).
Now I found that:

Model early stopped at10th epoch while previously (on 1600 hrs data) early stopped at 13th epoch.
I got WER: 0.31, train loss: 51.1 and valid loss: 64.20.

I tested this model on two audio, first, new audio from the first 1600 hrs of data and second, data from new 300 hrs. I found the model gives the same transcription as before for the first type of audio, but for the second type of audio (audio from new 300 hrs of data), it skips lots of words and transcription is also very poor.

prediction result improved when I give an audio chunk of almost 1-4 sec and the chunk is amplitude peak normalized through Audacity.

So I then normalized my whole training data(1900 hrs) and trained from scratch again, this time as well model early stopped at 10th epoch, train loss becomes 50.09 from 51.1 and validation loss become 63 from 64.20 (minor changes in losses) and WER become 0.2982, but transcription did not improve.

My question is:

What was the reason behind the model’s behaviour?
Why model transcription is better on normalized audio chunks, Do I need to give training audio of the same length?
Since I trained the model from new data (300 hrs) converted from stereo to mono, the model’s accuracy degraded, Does stereo to mono data conversion affected the accuracy?

Any Help appreciated.
Thanks

lissyx · November 18, 2019, 9:13am

Early stopping might require finer tuning than you provided, I’d take that with some restraint.

I don’t completely understand your flow here.

Well, that’s very low change, so it’s not surprising the transcription does not improves

Sorry, but that’s too generic, what are you exactly refering to?

I have not understood exactly how your added 300 hours are built, and we don’t know length distribution of others.

Devil lies in details. In theory, no. In practice, it depends on a lot of factors, including how you uses sox. Since you are downgrading sound, it should be possible to do it in such a way that does not looses audio informations. But if you are not careful, you may introduce artifacts that breaks.

nasim.alam086 · November 19, 2019, 6:07am

Thanks for the prompt reply. Let me clear you at this:
previously One month ago, I trained a model on 1600hrs of data which contains 1.5 to 10 sec of audio chunks the average length of audio is 5-6 sec. The whole dataset was mono. So I got wer=0.20 on the model trained on this dataset.

last weekend, I got 300 hrs of new data similar to the previous dataset (same environment but seems like speech rate high means spoken fast as compared to the previous one) and call recordings were stereo.
So I converted these audio to mono using SoX: “sox infile.wav outfile.l.wav remix 1”.
Then I made this dataset distribution the same as the previous one (1.5sec to 10sec audio chunks of each audio) and then mixed new 300hrs of data with 1600 hrs of previous dataset.

The new trained model did not give me as a good result, WER: 0.2982, train loss: 51.1 and Val loss: 64.20.

I normalized the peak amplitude of the whole data using FFmpeg-normalize due to high speech rate in the new dataset (300 hrs).

[quote=“lissyx, post:2, topic:48714”]
Devil lies in details. In theory, no. In practice, it depends on a lot of factors, including how you uses sox. Since you are downgrading sound, it should be possible to do it in such a way that does not looses audio informations. But if you are not careful, you may introduce artifacts that breaks.

Thanks

lissyx · November 19, 2019, 8:48am

You need to make sure this is not crippling your data. Especially, I remember some unobvious behavior about dithering that would alter results quite a lot.

On which test set was this evaluated ?

So you used sox and ffmpeg right? That’s likely to add different artifacts, you should triple check that as well.

Would your normalization reduce the pace of speech ? Have you applied that when you get the higher WER ?

reuben · November 19, 2019, 8:55am

As well as what lissyx has already said, your normalization strategy sounds like a bad idea to me. First of all, I don’t see how amplitude normalization is connected to rate of speech at all. Second, it sounds like you had 1600 hours of data that is not normalized and then you added 300 hours of normalized data. This discrepancy may very well be what’s hurting your accuracy on the second training run.

alchemi5t · November 19, 2019, 12:44pm

Quick question. Where exactly are you getting this much data from?

nasim.alam086 · November 20, 2019, 6:51am

Thanks for the reply.

I evaluated this model on new calls (other than train and validation) from 300 hrs dataset. I also evaluated it on test calls from the previous 1600 hrs data.

OK sure.

Yes

No, I don’t able to realize that, it reduced the pace of speech but it averaged the peak amplitude and the prediction on that normalized call far better than the normal wav files. So I thought to normalize the whole 1900 hrs of data and then train the model again.

The WER has not changed much, it was 0.34 on new test data(300 hrs only dset) and 0.26 (on test data of first 1600 hrs dset only). Previously It was 0.39 and 0.29.

So my question is, Is there any perfect way of conversion of audio from stereo to mono apart from SOX, or Audacity, and
if am using SOX or audacity then how can I ensure artifacts retained.

Thanks,

nasim.alam086 · November 20, 2019, 7:01am

Sorry, its my bad, I forgot to mention that new dataset was having high volume as well along with high rate of speech, so I thought normalization would solve both this problem but It did not for pace of speech. Thanks

No, I normalized whole 1900 hours of data (prev 1600 hours and later 300 hrs).

Yes, I checked that as well, it happens that’s why normalize the whole dataset. Thanks a lot.

lissyx · November 20, 2019, 8:17am

Ok, this is unclear. Can you show us the whole picture at once ?

I don’t understand, once you compain the WER is worse, now you say different figures.