Proposal: Sentences lenght limit from 14 words to 100 characters

Hi everyone,

I’ve been working with @josh_meyer from the Deep Speech team to determine the best lenght limit for sentences in the corpus.

The proposal is to move from the current 14 words to 100 characters.

(Note that this limitation can potentially be adapted in the future per language.)

We need to find a reasonable, language-independent way to flag sentences that are too long for Common Voice.

The problem we are addressing is the following: Long sentences are harder for people to read, and also they are harder to train good speech recognition models. So, if we can exclude sentences that are too long, both the volunteers on Common Voice and engineers using the data will be happier!

Unfortunately, the idea of a sentence being “too long” is very to define. Some languages have sentences with long words, but fewer words per sentence (e.g. Turkish) whereas some sentences have sentences where a single character can be a word (e.g. Mandarin).

So, we want to find a good metric which we can use to flag sentences that are too long, and so far we’re looking at the sentences we’ve already collected, working from there.

We’ve found that for all the languages collecting audio data in Common Voice, the average character length per sentence is 47.2. However, there’s a wide range of lengths of sentences, from 1 up to 420. Right now we think that 100 characters is a good, language independent cut-off point.

This topic will be open 5 days for feedback (until February 6th), after that we will analyze all feedback and Josh and I will take a final decision.

  • Are we missing something?
  • Does this proposal make sense?
  • If not, what’s your proposal and why?

Thanks!

With regards to length in time, we would like the read-aloud sentences to be around 5 to 10 seconds long for training good speech recognition.

1 Like

I just made a test:

For the sentence: Rran-t d tilist i umezruy-nneɣ, s lqern wis sebɛa i yebda umezruy-nneɣ, akkin akk ulac kra i yellan.

  • Size: 100 chars (including every char: blank, ponctuation, separators, alphabet)
  • Recording time: 8 seconds
  • Reading speed: slow to normal.

2019-02-01%2015_28_59-Common%20Voice

For kabyle: max length between 80-90 (including every char) should be right.

3 Likes

If 100 chars turns out to 8 seconds, then we can probably afford to have some more characters

That’s for kabyle with short words and lot of spaces and the char “-” !!! Words in French for example are longer than kabyle.

1 Like

@nukeador — Is there a way to have a warning at 100 chars and a hard cut-off somewhere like 250 chars?

The way validation works is all or nothing, your sentence is OK or it’s not, if there is not a hard cap people will tend to use 250 as the norm, not as an exception, and I suspect 250 is probably too long.

As I said, we can probably evaluate this for languages where 100 is clearly too short.

Well, I think 96 would be good, although mapping a distribution for each language would be the ideal. The top one thousand English words have on average 4.5 letters, I think most words have 8 or 9 characters, so 100 limit could be 8 or 9 words per sentence which I think is good enough.

I’d be tempted to go a little longer than 100 given that this is the absolute limit - many of the sentences in Reuben’s first post above are well above. You’ll risk getting a load of one-clause sentences.

Would 120 work?

A slight modification of the proposal:
Reject a sentence if it consists of both more than 14 words and more than 100 characters.

The 14 words limit is public since September 2018. I (and hopefully others as well) have written lots of sentences since then, closely paying attention not to exceed it. I did not count characters and am a bit afraid that many might have slightly more than 100 characters.

Can you do a quick calculation on the 14 words sentences you have to know the average number of characters?

But I agree, we should make sure sentences submitted with the previous 14 words limit are not affected.

Okay, some math: I analayzed my current corpus of 17740 German sentences which contains 279 sentences with exactly 14 words. 38 of these 279 sentences (14%) have more than 100 characters. Other statistics about the number of characters (still referring to 14 word sentences):

mean = 87.99
median = 87
min = 68
max = 127
standard deviation = 9.72

1 Like