I’ve trained on LJSpeech to confirm a working set up.
I’ve created a custom dataset of ~15K utterances. AnalyzeDataset looks good after I filtered out outliers (> length 63 characters). CheckSpectrograms also checks out. My dataset is in LJSpeech format.
I think the best way to summarize the performance is this: after 20K iterations, all 4 test files in test_audios/2XXXX are the same length and the same length as the first 4 test files in 1XXX. In contrast, LJSpeech quickly shows divergence in files and file lengths.
I’m a bit stumped after having carefully checked file formats/sampling rates, etc. I know my data isn’t completely clean, but I would have expected some result. FWIW I’m using subtitle data for the character Cartman from South Park.