Hi everyone,
I’ve been working with @josh_meyer from the Deep Speech team to determine the best lenght limit for sentences in the corpus.
The proposal is to move from the current 14 words to 100 characters.
(Note that this limitation can potentially be adapted in the future per language.)
We need to find a reasonable, language-independent way to flag sentences that are too long for Common Voice.
The problem we are addressing is the following: Long sentences are harder for people to read, and also they are harder to train good speech recognition models. So, if we can exclude sentences that are too long, both the volunteers on Common Voice and engineers using the data will be happier!
Unfortunately, the idea of a sentence being “too long” is very to define. Some languages have sentences with long words, but fewer words per sentence (e.g. Turkish) whereas some sentences have sentences where a single character can be a word (e.g. Mandarin).
So, we want to find a good metric which we can use to flag sentences that are too long, and so far we’re looking at the sentences we’ve already collected, working from there.
We’ve found that for all the languages collecting audio data in Common Voice, the average character length per sentence is 47.2. However, there’s a wide range of lengths of sentences, from 1 up to 420. Right now we think that 100 characters is a good, language independent cut-off point.
This topic will be open 5 days for feedback (until February 6th), after that we will analyze all feedback and Josh and I will take a final decision.
- Are we missing something?
- Does this proposal make sense?
- If not, what’s your proposal and why?
Thanks!