What’s the approach regarding spoken text normalisation now that sentences are coming directly from Wikipedia / public domain books (which would have the text in written rather than spoken form)?
I believe when Common Voice started numbers were excluded or had to be written out (I might have this wrong but I think that’s what I recall seeing). Is there a step for the import that normalises them or do numeric and other normalisable things get excluded somehow? If there is something doing normalisation, how robust is it, (it seems quite a hard problem)
If the sentences aren’t normalised then presumably they’ll challenge deepspeech (as what is actually spoken won’t correspond directly) and there’s a chance that some people will normalise them differently if this isn’t done (eg 15.45 could be fifteen forty-five or quarter to four). Numbers are the biggest category but there are plenty of other cases (eg abbreviations, acronyms vs initialisms etc).