Spoken text normalisation

nmstoker · August 29, 2019, 3:37pm

What’s the approach regarding spoken text normalisation now that sentences are coming directly from Wikipedia / public domain books (which would have the text in written rather than spoken form)?

I believe when Common Voice started numbers were excluded or had to be written out (I might have this wrong but I think that’s what I recall seeing). Is there a step for the import that normalises them or do numeric and other normalisable things get excluded somehow? If there is something doing normalisation, how robust is it, (it seems quite a hard problem)

If the sentences aren’t normalised then presumably they’ll challenge deepspeech (as what is actually spoken won’t correspond directly) and there’s a chance that some people will normalise them differently if this isn’t done (eg 15.45 could be fifteen forty-five or quarter to four). Numbers are the biggest category but there are plenty of other cases (eg abbreviations, acronyms vs initialisms etc).

nukeador · August 29, 2019, 4:26pm

The same rules we applied to sentence collection, are being applied to sentence extraction, you can read them all here:

https://common-voice.github.io/sentence-collector/#/how-to

nmstoker · August 30, 2019, 12:08pm

Thank you - I’d missed that