Feedback: Enabling small languages

nukeador · February 19, 2019, 5:03pm

Hello everyone,

We are usually pinged to enable new languages in Common Voice to have the site localized and to gather new sentences.

I want to bring this conversation to community because, as we know, in order for a dataset to start to be effective, Deep Speech needs at least 2000hrs of voice validated and minimum 1000 different speakers.

What we should do with languages where, because of their side, it is not realistic they will manage to get 1000 speakers?
Is a smaller dataset still useful for other work not related with Deep Speech?

Thanks for your feedback!

dewi.jones · February 20, 2019, 9:24am

Smaller datasets can be more than useful for smaller domains.

I am developing a simple Welsh language voice assistant ap for Android using CommonVoice data, DeepSpeech and a simple language model. ("What is the weather? What is the news? Play me some music etc. ) . CommonVoice data (and DeepSpeech) is invaluable for us to begin developing such software. In time, I hope the app can stimulate more to contribute, and thus widen the number of commands and domains.

CommonVoice gives a powerful message to users and developers in smaller language communities - that they are not excluded, at least by one “tech giant”, by virtue of their size, from the new speech based web paradigm, and that everyone can make a contribution. I hope Mozilla enganges more with these communities so that challenges and success are shared.

rprys · February 20, 2019, 10:04am

It’s great to see Mozilla continue it’s support for minority languages with it’s invitation to all languages to contribute to Common Voice. Not all languages are in a similar situation due to population size, commercial and governmental support, but I believe that all languages communities deserve the ability to contribute towards ensuring their language is empowered by voice based technologies.

It may be that Deep Speech needs the data and speaker amounts you mention but we’re in a developing technological situation where technical and linguistic developments can make a big difference.

I hope we’re also in a situation where we would not wish to close the door on minoritized languages due to technical considerations.

Where possibly Mozilla could assist these situations would be to collect experiences of successful data collection campaigns and share them so that we can all learn from their successes, see Kabyle and Catalonian.

Peiying Mo has prepared a useful ‘Guide to promoting Firefox in your language - a community marketing kit’ for localizers of Mozilla products - https://mozilla-l10n.github.io/localizer-documentation/misc/community-marketing-kit.html . Something similar for Common Voice could be useful.

The public campaign in Wales to encourage contributors to Common Voice has raised the profile of Mozilla to a higher level than for many years, which has been great for technology in Wales, the Welsh language and Mozilla. Long may it continue!

Rhoslyn Prys

nukeador · February 20, 2019, 11:30am

@dewi.jones @rprys Thanks for your feedback!

To clarify: I think Common Voice should not only serve Deep Speech needs but also other applications we haven’t though about, that’s why having your experiences shared here is so important.

My goal is that the Common Voice community is self-sustainable in a way that it can provide value to different players, specially small players that are ignored by tech gigants

dabinat · February 20, 2019, 5:30pm

I think if people are willing to contribute, why not let them?

The data also might be useful in situations where what is actually being said doesn’t matter too much. For instance, I am considering starting a project that would require me to get noise print samples so the language being spoken makes no difference to me.

mstanke · February 20, 2019, 6:58pm

Under the Common Voice umbrella there are actually two datasets being collected - text and voice.

I am not an expert on voice data, but textual dataset, even raw text without any annotations, could be very helpful for any applications like search, indexing, spellchecker, dictionary etc.

irvin · February 21, 2019, 4:31pm

Many datasets on the market also don’t have big data neither, actually most of the datasets I saw are only have dozen hours.

I believe although we may not have a thousand hours for most of the languages, common voice data can still be very valuable. It’s better to have some than no at all anyway.

cjbaker · February 27, 2019, 1:21pm

Many approaches are available for what are described as “low-resource languages” in the literature, including transfer learning from models taught on high-resource languages. Every bit of data helps, and 1 is much better than 0 data. Also read about zero-resource learning, where there is actually no training on the target language until it’s time to do recognition! In this challenging case you might just start with something like a target language word list and a recognizer trained on another language (so not quite “zero”). As soon as you have even a small dataset with transcribed segments as in Common Voice, you should be able to do much better.

I believe the Deep Speech model, unmodified, is simply one of the highest-performing architectures when you have lots and lots of data available, but there are hundreds of other architectures around.

In my opinion, what needs the most attention now is prompt design (i.e. the sentence collector). With big recurrent neural networks, I think it’s really best to have very little repetition of prompts and of prompt wording. We also need to make sure we’re getting speaker IDs right, so that model developers can strictly partition speakers into training and validation sets.

" * Is a smaller dataset still useful for other work not related with Deep Speech?"

Very much so, if it is collected carefully, speaker metadata is recorded accurately, etc. Here’s an example of an interesting (copyrighted) dataset–it’s collected from audio bibles in 700 different languages, and has been used to train the Festvox TTS (text to speech) system (I believe for all 700 languages): https://github.com/festvox/datasets-CMU_Wilderness There’s no way you would’ve seen TTS systems for so many languages without the data.