First of all…
Cheers and thanks for this amazing Project. I nearly waited for four Years for something like this.
I can’t explain how grateful I am…
So after I had reached 12% in german, I was determined to reach these results with other languages aswell.
long story short…
I build some python scripts to collect/sort and clean datasets for deepspeech.
You can prepare trainings data with just one command.
i tried to make it as user-friendly and convenient as possible.
Any suggestions or questions are welcome. If you have any idea for future features, let me know.
and share your results/arguments.
If you know some more datasets for the languages below plz share them with me.
Maybe you can write your government if they are holding back data like in the netherlands. They are damaging only themselves…
I will integrate them aswell.
Datasets so far :
common voice
voxforge
librivox
spoken wiki (aligner is broken - will be fixed with the next version)
tatoeba
tuda
zamia
vystadial
african accented french
nicolas french
i won’t put the download links for the cv dataset in the db because of the agb.
Whoever i will create some options to insert the links after you accepted the agb’s and received the links.
Tests so far:
de = 9.84%
https://drive.google.com/open?id=1quyJ9cHX4f5wEg3K3QayEmgqlhoYUPUd
pl = 13.7%
https://drive.google.com/open?id=14oDu1Kes2I16ReBhCJpAFETHVRETlT0N
es = 13.9%
https://drive.google.com/open?id=1Yw5SUbIzKUqsEQCwP-eoaTW492QYc1Ol
it = 18.4%
https://drive.google.com/open?id=14l-jx56zM84EWpfhkYT0gHc9cZZ-Ti9D
fr = 22.7%
https://drive.google.com/open?id=1tHNM-7HnPQBdooVgTxNl6F-pgVbRMk3h
uk = 29.9%
https://drive.google.com/open?id=1dQ5MzlkhjdiQpLCJDNqV1-Z2GsquXIx1
ru = 36.9%
https://drive.google.com/open?id=1eBm2aD0QGh8y5LgZP0MYqZresdcVIvgz
nl = 39.6%
https://drive.google.com/open?id=1eP8ug3qTUwodI3uEaofJ5xqjhMfjWUs6
pt = 50.7%
https://drive.google.com/open?id=1QE7PIUnQXS6X_t90O8bTiupJu5a0kPf-
cs = not enough data
lt = not enough data
da = not enough data
et = not enough data
fi = not enough data
ro = not enough data
sq = not enough data
bg = not enough data
hr = not enough data
el = not enough data
ca = not enough data