Questions about speech corpora for pre-trained model

dabinat · February 2, 2019, 10:25pm

I had some questions about the pre-trained model for 0.4.1.

How many hours of data in total were used to train the pre-trained model?
What are the proportions of each speech corpus used? i.e. is it mainly LibriSpeech, Common Voice or an even mix of all of them?
It says that the model is optimized for American English but that it uses the English Common Voice corpus, so presumably this isn’t filtered first and thus contains all English accents?

kdavis · February 3, 2019, 5:59am

This info is also in our release notes here. Particulary…

2000 (Fisher) + 260 (Switchboard) + 1000 (Librispeech) + 600 approx (Common Voice) = 3860 hours
It uses all of the above corpora
Yes it contains all accents, but so does American English. Particularly, it’s dominated by Fisher which is dominated by American English, see the links above.

dabinat · February 3, 2019, 6:53am

Thanks! I asked because I read the release notes before asking and that information isn’t listed there, but it would be great if it was in future.