Multi-language Dataset Beta Release

lsaunders · February 15, 2019, 5:47pm

The multi-language dataset is now available to the Common Voice community as a beta release! This release includes all new, multi-language data that has been collected in 2018.

There are two reasons for choosing a community-focused beta release. First, the data in this release is raw. The Common Voice team will continue improving the way the data is bundled across languages, but we also want to get the dataset in the hands of those who want to start using it immediately.

Second, before a wider release, we need the help and expertise of this community to make the data better for everyone. With your help, we are targeting a full release on the Common Voice site by the end of January.

Mozilla’s DeepSpeech team has created a CorporaCreator repository on GitHub with tools for processing the Common Voice dataset. To help clean the data you can either write or improve a preprocessor for a language (here is the one shared across languages) or you can post a comment about irregularities you may have noticed in the dataset. In particular, we are looking for irregularities like:

Numbers. There should be no digits in the source text because they can cause problems when read aloud. The way a number is read depends on context and might introduce confusion in the dataset. For example, the number “2409” could be accurately read as both “twenty-four zero nine”; and " two thousand four hundred nine".
Abbreviations and Acronyms. Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE”; could be pronounced “I-C-E” or as a single word.
Punctuation. Special symbols and punctuation should only be included when absolutely necessary. For example, an apostrophe is included in English words like “don’t” and “we’re” and should be included in the source text, but it is unlikely you’ll ever need a special symbol like “@” or “#.”
Foreign letters. Letters must be valid in the language being spoken. For example, “ж” is a letter in the Russian alphabet but is never used in English and so should never appear in any English source text.

To get started, you will need to download the dataset’s clips.tsv file and follow the instructions in the included README. This will only give you access to the text data.

For access to the full dataset, including voice clip audio, you will need to fill out this form.

Reviewing and cleaning the Common Voice data will help everyone who uses it – from academics to small companies and all the makers who need CC0 data – to move forward with a voice-enabled project. The Common Voice team is committed to building a dataset of clean and stable data so we can practice appropriate version control and provide everyone with a way to recreate any testing they need to do in the future.

Thank you for being a part of this project!

belkacem77 · December 18, 2018, 9:31pm

Finally.
Thanks for release.

liordon · December 26, 2018, 12:02pm

What languages can we expect to be released?
And about those which won’t be - what’s missing for them?

Happy new year!

lsaunders · December 27, 2018, 7:37pm

We are going to release all of the languages that have data in them as of October 2018 which includes 16 languages. You can see the full list of languages here https://voice.mozilla.org/en/languages

lsaunders · January 11, 2019, 6:11pm

Hello Everyone, the new dataset is back up and ready for use!

entn-at · January 11, 2019, 7:36pm

Hi,
thanks for your hard work on making this release happen!
I imagine you’re getting tons of requests for access to the voice data right now. Is giving access a manual process (i.e., giving access only after review of each form submission by a human)?

lsaunders · January 11, 2019, 11:14pm

Hi there,

If you would like voice access you can fill out the form above and we will be sending a link via email at the beginning of each day. Each of the sentences you hear has been reviewed by 2 humans to ensure its correctness. Does that answer your question?

entn-at · January 12, 2019, 12:15am

Thanks, that fully answers my question. I filled out the form above earlier today (shortly after your post), but haven’t receive any email, so I was just wondering if it had to be approved first. Again, thanks for your work on this and I’m looking forward to working with the data!

lsaunders · January 14, 2019, 5:48pm

You should be getting your email shortly!

areyliu6 · January 16, 2019, 6:22am

Hi @lsaunders,
I’ve found there is an audio file broken in zh-TW dataset.
53777c75a47473ca6101ac395e74d3a8e9b66f2ad58ce3d7defc1a22761f5f0b7072ddf8d62fd06be02a4843587ea1322c29f90b61edf99cc608981306dc35e4.mp3
This audio is in other.tsv.

Finally,
thanks for release.

lsaunders · January 16, 2019, 3:49pm

Thanks for the heads up @areyliu6!
@gregor Do you need any further information about the break? Lets review in the next sprint meeting.

belkacem77 · January 21, 2019, 8:42pm

I downloaded the kab dataset but I can’t find the transcript (sentences). I got only the audio files.
I’m going to train the first dataset using deepspeech to show it on an event we are going to organize to show the importance and recruit more recorders from Kabylia.

Thanks again for the release.

gregor · January 22, 2019, 9:16am

The sentences are in the clips.tsv file. If you want to get them split up by language, validity & bucket, you need to run the CorporaCreator on the file.

belkacem77 · January 25, 2019, 10:26am

Thanks again for help wonderfull

belkacem77 · January 26, 2019, 9:23am

But when I downloaded the audio files, there is no file clips.tsv. Is this file downloadable lownly?

irvin · January 26, 2019, 6:39pm

Question from some local community member about the data:

Are we delivering the voices which only had been verified for multiple times on the site?
If yes, then what’re the differences of voices listed in valid.tsv and in other .tsv (besides invalid)?

gregor · January 30, 2019, 10:47am

You can find that in the first post in this thread:

gregor · January 30, 2019, 10:49am

The clips.tsv files contains the number of votes each clip got and the audio files are everything we have for this language up to this point (which for that release is 2018-12-19 I think). The valid.tsv only contains clips which have at least 2 up-votes and more up- than down-votes.

belkacem77 · January 30, 2019, 10:57am

Thanks @lsaunders gweber

tomasland6 · March 1, 2019, 12:36pm

Hi there. I just want to tell thank you for the really great job. We have been waiting for this release so long and finally you did it