Training - Custom voice doesn't train

dubreuia · July 8, 2019, 11:52am

Hello,

I’m trying to train a custom voice, with 24 hours of an audio book recording and associated transcription in LJSpeech format for the dataloader (I’ve posted a sample on my github https://github.com/dubreuia/hosting/tree/master/mozilla-tts/custom-dataset-sample). I’ve used the notebooks AnalyzeDataset and CheckSpectrograms to check my data first, it looks good to me.

After 60K iter (see screenshots), I have nothing (I’ve trained to 100K to no avail), the audio is blank, or faintly humming. I’ve trained on LJSheech, I have good results at 100K as it should.

It is probably similar to Custom voice - TTS not learning.

Does someone have an idea why my dataset wouldn’t train?

Thanks,
Alex

erogol · July 8, 2019, 1:34pm

You are overfitting as the plots show. Take a good care of the difference between eval and train stats.

In case: https://en.wikipedia.org/wiki/Overfitting

How large is your dataset? Do you have any samples to share?

dubreuia · July 8, 2019, 1:52pm

Thank you @erogol for this project and for helping me out.

You are overfitting as the plots show. Take a good care of the difference between eval and train stats.

Yeah I can see my validation error going up while training error going down.

How large is your dataset? Do you have any samples to share?

I’ve posted a small sample on my github https://github.com/dubreuia/hosting/tree/master/mozilla-tts/custom-dataset-sample but I can post more. I have 22 hours or recordings, split in 14995 chunks, lengths varying from 1 seconds to 10 seconds (with most recs at 4,5,6 seconds).

I’m thinking maybe the sample rate is wrong (because I split the audio from an mp3 file, then converted it to wav), but I’ve used both notebooks to tune my config so I’m pretty sure it is good.

erogol · July 8, 2019, 2:06pm

you can check SR with soxi command. Make sure it is the same with config.json.

However, the specs sound good for your dataset.

You can also use dataset analysis folder and the notebooks there to see more about your dataset.

I also see some pre-mature ends in some of the samples you share

It is, in general, the data quality that hinders TTS, Just iterate through your dataset until it works.

dubreuia · July 8, 2019, 5:07pm

I also see some pre-mature ends in some of the samples you share.

Yes, I’ll try splitting the audio on pause instead of word count as I’m doing right now. Do you think having no punctuation plays a role?

It is, in general, the data quality that hinders TTS, Just iterate through your dataset until it works.

Do you think the audio quality is sufficient? My audio has a lot less dynamics then the LJSpeech and the speaker is a lot more monotonic.

Thanks,

erogol · July 9, 2019, 4:37pm

yes it prevents attention to work and learn properly.

Sounds good but you need to check it quantify it as I told.