Discussion of new guidelines for recording validation

Michael_Maggs · November 19, 2019, 1:18pm

DRAFT GUIDELINES FOR REVIEWING RECORDINGS

[Edited to include comments up to 2 April 2019]

Problems with the recording

Misreadings

You need to check very carefully that what has been recorded is exactly what has been written - reject if there are even minor errors. Very common mistakes include:

missing out ‘A’ or ‘The’ at the beginning of the recording.
missing an ’s’ at the end of a word.
Reading contractions that aren’t actually there, such as “We’re” instead of “We are”, or vice versa.
Missing the end of the last word by cutting off the recording too quickly.
Taking several attempts to read a word.

For example:

The giant dinosaurs of the Triassic.
Giant dinosaurs of the Triassic.
[‘The’ omitted]
The giant dinosaur of the Triassic.
[Should be ‘dinosaurs’]
The giant dinosaurs of the Triassi-.
[Recording cut off before the end of the last word]
The giant dinosaurs of the Triassic. Yes.
[More has been recorded than the required text]
The giant dinosaurs of the Tri- Triassic.
[The first ’Tri-‘ is not in the written text]
We are going out to get coffee.
We’re going out to get coffee.
[Should be “We are”]
We are going out to get a coffee.
[No ‘a’ in the original text]

Varying pronunciations

Be cautious before rejecting a clip on the ground that the reader has mispronounced a word, has put the stress in the wrong place, or has apparently ignored a question mark. There are a wide variety of pronunciations in use around the world, some of which you may not have heard in your local community. Please provide a margin of appreciation for those who may speak differently from you.

On the other hand, if you think that the reader has probably never come across the word before, and is simply making an incorrect guess at the pronunciation, please reject. If you are unsure, use the skip button.

On his head he wore a beret.
[‘Beret’ is OK whether with stress on the first syllable (UK) or the second (US)]
His hand was rais-ed.
[‘Raised’ in English is always pronounced as one syllable, not two]

Background noise

We want the machine learning algorithms to able to handle a variety of background noise, and even relatively loud noises can be accepted provided that they don’t prevent you from hearing the entirety of the text. Quiet background music is OK; music loud enough to prevent you from hearing each and every word is not.

{Sneeze} The giant dinosaurs of the {cough} Triassic.
The giant dino {cough} the Triassic.
[Part of the text can’t be heard]

If the recording breaks up, or has crackles, reject unless the entirety of the text can still be heard:

{Crackle} giant dinosaurs of {crackle} -riassic.
[Part of the text can’t be heard]

Background voices

A quiet background hubbub is OK, but we don’t want additional voices that may cause a machine algorithm to identify words that are not in the written text. If you can hear distinct words apart from those of the text, the clip should be rejected. Typically this happens where the TV has been left on, or where there is a conversation going on nearby.

The giant dinosaurs of the Triassic. [read by one voice] Are you coming? [called by another]

Volume

There will be natural variations in volume between readers. Reject only if the volume is so high that the recording breaks up, or (more commonly) if it is so low that you can’t hear what is being said without reference to the written text.

Reader effects

Most recordings are of people talking in their natural voice. You can accept the occasional non-standard recording that is shouted, whispered, or obviously delivered in a ‘dramatic’ voice. Please reject sung recordings and those using a computer-synthesized voice.

Problems with the written text

Please see Discussion of new guidelines for uploaded sentence validation

Still unsure?

If you come across something that these guidelines don’t cover, please vote according to your best judgement. If you really can’t decide, use the skip button and go on to the next recording.

Michael_Maggs · March 3, 2019, 12:12pm

At present there are quite a large proportion of recordings that fall into the grey area where it’s unclear whether to answer yes or no. Perhaps we could use this thread to discuss and finalise something? I’ve posted a draft, above, to get us started.

dabinat · March 3, 2019, 5:35pm

This is really great, Michael. I agree with the points you have made. However, I’m not so sure about this one:

There are uses for CV and DeepSpeech beyond someone directly dictating to their computer. In my opinion, CV’s voice archive should contain as many different ways to say something as possible. Maybe singing is a stretch, but shouting, whispering or delivering the line in an overly theatrical way are ok in my opinion. In fact, some sentences even encourage the reader to whisper them.

Michael_Maggs · March 3, 2019, 9:14pm

You may well be right. I’d be interested to hear what the programmers’ expectations are.

nukeador · March 5, 2019, 12:12pm

Thanks for starting this @Michael_Maggs, it looks really good.

I will ping @kdavis and @josh_meyer for feedback on the ML expectations (in terms of what’s good/bad for deepspeech).

Also pinging @mbranson in terms of user experience, since I feel these rules are great but we need a way for people to be able to contribute without having to read a lot of instructions. You post is super helpful to understand the problems we are encountering and we might have a way for the tool to help users with the process.

Cheers.

kdavis · March 6, 2019, 9:42am

Generally this is good set of guidelines. Nice work @Michael_Maggs!

However, I note that we should be clear to separate the guidelines for validating text from those for validating speech, the Problems with the written text section is more geared towards text.

One note on the “Ignore minor problems of punctuation if they don’t affect the recording” part…

The example ‘“the giant dinosaurs of the Triassic,’ is given as one in which the punctuation does not effect the reading. This is actually not quite the case.

Think of sentences which include commas. Generally when a comma is used correctly, it indicates a pause. A speech-to-text engine trained on text which uses commas correctly and which is read with the associated pause would learn to insert commas at the appropriate pauses.

However, if sentences similar to the above ‘“the giant dinosaurs of the Triassic,’ were used to train the system, it would never learn to insert commas in the correct place as there would be no correlation between commas and pauses. Sometimes commas would occur at the ends of sentences, sometimes at the start, sometimes randomly within sentences.

So it’s better to reject such sentences as they will cause the engine to have a invalid knowledge of commas.

nukeador · March 6, 2019, 11:31am

That’s good to know, we’ll make sure the sentence collector cleans up these orphan commas automatically.

ajay.dixon · March 7, 2019, 7:44am

So we should be refusing voice samples that don’t pause for commas? That’s a lot of them.

There should be an information video for both submitting and validation that tells people this.

cjbaker · March 7, 2019, 2:49pm

So it’s better to reject such sentences as they will cause the engine to have a invalid knowledge of commas.

I don’t think it is common for speech recognizers to actually listen to speech timing (prosody) to determine punctuation, it is more commonly done as a post-processing step based solely on the text generated by the recognizer. If I’m not mistaken, the CTC used in Mozilla DeepSpeech excludes punctuation characters, and outputs all lower-case, unpunctuated text. Still though, I think correctly read punctuation is better than incorrect, and maybe some future system will put it to use. I expect this project’s data to be used much more widely than just Mozilla DeepSpeech.

I wonder if we could add a set of checkboxes in the validation interface for some extra annotations: misread punctuation, incorrectly pronounced words, audio problems, etc. Maybe the current “thumbs up / down” could be the default, but a larger annotation interface could be a per-user option? A range of annotations, not just “good/bad”, can be useful for many purposes.

kdavis · March 7, 2019, 3:01pm

I agree, but we have to think of the future not the present when creating the data set.

In addition, keeping one limited application in mind (Deep Speech) is the wrong way to think about the data set. The data set should retain as much fidelity as possible and users of the data set can choose to not use this fidelity, e.g. throw away punctuation, if they decide to do so. Creating the data set with this in mind will give it maximal utility.

Bloubi · March 13, 2019, 7:29am

Hi all, what about question marks and the like at the end of sentences? I often hear speakers not respecting them (for instance, to the text “you go first?” they would not raise the tone at the end of the recording, thus making it into a text that would rather look like “you go first !”). Should I still validate the recording, or reject it ?

Michael_Maggs · March 13, 2019, 3:58pm

@Bloubi I think you should accept that. There’s a lot of variety between different nationalities, and a rising tone does not always go with a question mark. Also, sentences like “Well, are you coming?” when said in a cross voice doesn’t have a rising tone.

Michael_Maggs · March 13, 2019, 3:55pm

I’ve edited the draft above to make it clear that we don’t want computer-synthesized voices. See What if people are using text-to-speech to record?

Michael_Maggs · March 13, 2019, 4:53pm

I’ve updated the draft at the top of this thread to include the various comments made so far. More feedback from the specialists would be welcome though. Pinging @nukeador, @kdavis, @josh_meyer, @mbranson, @gregor, @mkohler.

I agree that we don’t want to scare off new contributors off by presenting the guidelines up-front as an off-putting wall of text that they have to read. A light-touch way to have them available would be to add to the recording validation page a short line such as:

Unsure? Click the Skip button, or read the guidelines here.

mbranson · March 13, 2019, 6:22pm

Thanks for all the great work here @Michael_Maggs. In terms of UX I agree with the gist of what you’ve said above. In addition to a call to action link from the Listen page of the contribution portal, we’d want to highlight guidelines on the Speak page too. Making contribution an informed process on both sides of a clip may be most beneficial.

This is content we would also want to make available via the FAQ and through our upcoming About page implementation (where we outline the overall recording and validation process).

In the terms of longer outlook UX goals, we’d ideally create an ‘onboarding flow’ where concepts like this are presented earlier to lower the barrier for new visitors to get started. This is precisely where something like a video could be very effective indeed. cc @ajay.dixon

@nukeador it’d be most helpful to work with @lsaunders to get an initial implementation of this guideline work into our sprints when ‘finalized’; e.g. a guideline page w/link from the contribution portal and FAQ.

dabinat · March 14, 2019, 3:56am

Something to add to the list would be ambiguous sentences.

An example from the current corpus would be:

I only read the quotations.

The sentence has no errors but it’s still unclear what the correct pronunciation is.

Michael_Maggs · March 14, 2019, 8:09am

@dabinat I would think that is an issue that the algorithm simply has to cope with. It’s an annoying feature of the English language that can’t be avoided.

Michael_Maggs · March 14, 2019, 8:13am

@mbranson Thanks for the feedback. If it would be useful I could work up some similar guidelines to be linked from Speak page. They can be largely based on the same examples, but the focus would need to be slightly different.

Michael_Maggs · March 14, 2019, 11:04am

I’m wondering if we should tighten up the section of the guideline covering mispronunciations. When validating, I very frequently come across mispronunciations by non-native English speakers. Examples over the last half-hour include

‘bass drop’ (with a short ‘a’, as in the fish) [Bass drop]
‘Hewgee’ [Hughie]
‘jinny pigs’ [guinea pigs]
‘chemical’ (with the ‘ch’ pronounced as in chalk) [chemical]
‘contain-ed’ (three syllables) [contained]
‘It bet me’ [It beat me]
‘laat’ [laughed]
‘calorim-Ater’ [calorimeter]
‘knitting’ (with the k and n both pronounced) [knitting]

While I can’t be absolutely sure that these aren’t valid pronunciations that I’ve simply never heard before, it seems more likely that the reader is simply guessing.

How should these be handled? What do others do?

ajay.dixon · March 14, 2019, 12:04pm

I tend to click ‘no’ and move on for extreme mispronounced words. I’m of the opinion that soon enough, another speaker from their nationality will submit a correct recording.

I would click no for all of your examples, maybe the bass drop & calorimeter might be ok.

Any ambiguous pronunciation where I’m not sure, I click ‘skip’. That doesn’t happen very often though.