Common Voice Project Update - November 27th 2019

Metrics Dashboard

  • There is currently a small group of contributors working on the Kibana dashboard to get is ready and formatted for broader use.
    • We are currently working through a few kibana security issues and will be switching over to a self-hosted Kibana.
    • We will keep you up to date when we have a timeline for release.

Campaign Update

  • Launched on Monday November 18th with snippets in English, German, French, Spanish and Italian.
    • 15-20k daily clips!
    • This was the team’s first time to include Italian and we were so happy to see a large number of new community members contributing.
    • Full report here

Infrastructure

  • We’re going to stop all feature releases to work on the Common Voice Infrastructure for the next couple months. The full scope of this is still being discussed.
  • What does this mean and what things are we prioritizing?
    • Limiting site downtime!
    • Automating dataset releases!
    • Site accessibility!
    • And more
  • We are excited to give our engineers the time they need to make the site better for everyone and provide the community with the information they need to get the best datasets possible.

Email Newsletters

  • Email implementation
    • We would like to be able to offer localized emails to our contributors and are working to make that happen

Open Voice Data Challenge Pilot partner launch

  • The Common Voice team has launched a partner pilot to look at how competition and incentives help increase the quantity and quality of data received. To do this we worked with three other partner companies SAP, IBM, Lenovo and a small number of new contributors from those companies. These contributors are currently in week two of a three week challenge. Once we receive analyze the results from the challenge, we will decide if it makes sense to roll out this initiative to a larger group.
4 Likes

Do you have more information on the plan for this? I’m concerned about file sizes as the dataset grows - English is already 30 GB at around 700 hours, so a 10k hour dataset will be over 400 GB. If you’ve already downloaded the dataset before, it’s annoying to redownload 400 GB just to get the few GBs that are new in the update.

I think the best option is to have regularly scheduled large dataset releases (quarterly?), then once someone has the dataset they can run a script to download only files that are new. That way we could have nightly/weekly updates without it being a big strain on both users’ connections and Mozilla’s bandwidth bill.

1 Like

Currently we do not. The engineers are looking into the best way to do format this for our systems. We have just started looking at the infrastructure and realistically we’re looking at early 2020 for a release. We’ll keep updating everyone with progress as we have more information.

stop all feature releases to work on the Common Voice Infrastructure for the next couple months

Would you like me to split transcriptions from tagging and open a new issue so https://github.com/mozilla/voice-web/issues/814 can be converted to tagging? Gregor had referred me from https://discourse.mozilla.org/t/can-db-vote-be-a-boolean-union-with-a-utf-8-string/45941 ? If you have any questions about the motivations, please ping me at https://community.almond.stanford.edu/t/pulseaudio-event-sequence/153/3

@reuben @Nathan regarding the relationship to DeepSpeech

the sum of the acoustic model logit values for each timestep/character that contributed to the creation of this transcription.

Instead of using the sum or confidence values without context let’s train authentic context-aware models of listener intelligbility. Is providing language learners with contextual intelligibility awareness worth ~120 FTE hours of effort?

replied at Intelligibility remediation