Mozilla Speech Service

It’s apparent that voice recognition will play an important role in CD products. The Link team is now considering speech-based products as well.

Does Mozilla leadership support consolidating our speech efforts into a non-productized service? Has the Vaani team had any discussions to this effect?

If we were not concerned about privacy, we could plug in a third-party speech-recognition API and the CD trains would be able to quickly produce a number of proposed products.

It’s a no-brainer that a extracting our own voice resources into a compartmentalized service will be crucial for Mozilla strategically. (Maybe that’s as simple as Mozilla maintaining a Kaldi cluster with the appropriate client libraries, in the beginning.)

Is there any buy-in for this idea? Any opposition, other than the current leadership directive to “build products for consumers, not developers”?

1 Like

I’m not sure who you expect to get buy-in from Marcus, but I’ve been trying to get the official green light to re-use Vaani results in Link for a while…

I started to hack on pocketsphinx + Rust with André from the Vaani team to see how suitable that could be for Link.

The code is at https://github.com/fabricedesre/rust-voice/tree/master and currently just recognizes two commands (see https://github.com/fabricedesre/rust-voice/blob/master/model/grammar.jsgf)

From Ari and the CD leadership team: that a dedicated, non-product-train speech-recognition team would be valuable; and from the Vaani team, who might have different opinions or interest levels in a dedicated team.

I understand the desire to call PocketSphinx from Rust, but its utility seems entirely limited to keyword-detection and constrained demos. I get that it’s in the spirit of an MVP as an interim solution, but I’m more interested, personally, in how we approach speech more broadly.

All the voice recognition systems work with 2 stages: an “always on”, offline keyword spotting that can be implemented with PocketSphinx (and could also be used for a limited number of commands) and once you have detected the magic keyword the system switches to a more powerful online engine. As far as I know this is what has been implemented in Vaani with PS + Kaldi.

I agree that there could be value in Mozilla offering/hosting a performant, multi-lingual speech to text service. Many commercial solutions are very bad outside of English, and it’s worrying that the data for these models is once again going into closed silos.

Well said Fabrice. I whole heartedly agree that there is great value in offering/hosting a performant, multi-lingual speech to text service and opening up the training data and models the service is based upon.

To that end we, the Vaani team, are creating an app, which is not yet released, that allows members of the community to contribute speech samples to an open speech corpus. We will then use this open speech data and create open models that work with open software.

It’s something we have been working on for some time; however, there are many legal an technical details that need to be worked through before it is ready for public consumption.

Key word in that sentence is “were” :slightly_smiling: - we are concerned about privacy, and so we absolutely can’t and don’t want to white-label any black-box cloud APIs.

A Mozilla-hosted server running open source software (and publishing its language data to the public domain for self-hosters to reuse) is of course much more acceptable, though.