Speech type field?

susan · March 14, 2019, 6:54pm

I have been comparing word counts/frequencies of the English dataset and I noticed that it is skewed in some ways. For example, I only found 4 unique sentences with curse words in them (all ended up in the test set). I also only found two unique sentences with the word ‘hello’ in them. In the ‘train.tsv’ ‘hello’ appears once out of 128004 words. That is roughly 7.81 instances per million where common speech would be more like 104.11 per million. 8 per million is much closer to print, which is obviously not desirable.

I understand why this happens, it is hard to find sentences so this limits the representation for sources of speech. I appreciate what it takes to gather sentences and clearly there are efforts to improve the sentence collection. I think there is another way to, if not mitigate these distribution issues, at least allow better understanding of the data. Has there been any thought to adding a tag to help identify the source of speech in an utt? A couple of standardized classifiers could help understand what distribution this data comes from and possibly inform data collection in the future. I hope the utility of this information for studying speech recognition is obvious.

If this already exists somewhere or this has already been discussed I am sorry for re-hashing!

A good source of frequency data for comparison can be found in the COCA data, that is where I drew the comparison frequency for ‘hello’: https://www.english-corpora.org/coca/ .

dabinat · March 14, 2019, 8:21pm

The current dataset does not include any sentences contributed via the Sentence Collector (for English at least), so word coverage may actually be better than the dataset suggests. Mozilla recently updated the validation page to favor recordings with unique sentences first, so the next dataset release will include much better coverage. But I agree that there should be some kind of stats page that helps contributors figure out which words to target in sentences.

Also, with regards to profanity, Mozilla’s official position is to disallow it (the few that are there slipped in a long time ago). This is something I personally disagree with.

susan · March 15, 2019, 6:29am

Thanks, that clears up the profanity frequencies. If there is one thing I have been learning lately it is that real speech doesn’t obey rules! Profanity is actually a deeply interesting area for speech recognition exactly because so many organizations have rules against it despite how widely used it is in language. This makes it a challenge to get data and become good at it especially since, at least in the datasets I have been looking at, it appears to be one of the most dynamic areas in speech. It is also highly indicative of emotion which is obviously important from both an academic research perspective and a practical applications perspective.

Michael_Maggs · March 15, 2019, 11:03am

@susan, that sounds very interesting! I’d like to have a look at the list of words/ frequencies myself, with a view to extracting sentences with ‘needed’ words from public domain sources to be submitted via the Sentence Collector. Does that data exist in easily accessible form, or is it a question of downloading the entire corpus including the recordings? Any information you can provide about the way in which you way in which you did your comparisons would be most helpful.

susan · March 15, 2019, 5:39pm

The easiest source I have for individual frequencies is the online Corpus of Contemporary American English You can use their tools to compare different words and phrases. They have categorized the data into spoken, fiction, popular magazines, newspapers, and academic texts. The breakdown by source is an amazing resource from a speech perspective. The downside is it is commercial so other than the occasional lookup like the above for ‘hello’, it probably isn’t compatible with this project.

Another source for English is the British National Corpus. A CC 2.0 SA license version of the spoken frequencies can be found at http://ucrel.lancs.ac.uk/bncfreq/lists/2_2_spokenvwritten.txt . I think that would be compatible with this project since it could be used to guide data collection but wouldn’t be attached to the data created. Just appropriately cite it in the source and add it to the project or cite it in the project and have an individual deployment retrieve it as part of setup.

The big issue here is that this is all just English. It could be part of localization that this type of frequency list is created and maintained. research would have to be done on a language by language basis to find a similar list, if even available.

I think that by having a ‘source/type’ field attached to each utt would this type of spoken frequency data could be gathered as part of this project. It could also be useful in prioritization. For example priorities could be:

general conversation
performance (for example plays)
print dialogue (audio book type utts)
print news (not dialogue, just a factual reading of a subject)

These are obviously completely arbitrary, made up on the spot, categories and priorities just to present the idea.

Michael_Maggs · March 15, 2019, 5:45pm

Thanks, I’ll definitely look at those. How did you get the list of words in the CV corpus? I may be missing something obvious, but so far as I can see the only public downloads include all the recordings as well.

susan · March 15, 2019, 6:01pm

I parsed the public cv data for frequency counts using a quick python script. For the ‘hello’ example I built frequencies just of the train set. The validation set would be highly skewed due to the repeats.

dabinat · March 15, 2019, 6:47pm

All of the sentences are in this folder on the repo: https://github.com/mozilla/voice-web/tree/master/server/data/en

Ones from Sentence Collector are in sentence-collector.txt. There is also sentence-collector.json with stats like how many validation votes a sentence got and where the submitter said the data came from.

susan · March 16, 2019, 5:05am

Thanks for pointing that out. Unfortunately those aren’t types so there isn’t an automated way of binning them, especially in other languages. There are only 26 sources (of which some are repeats with different names) so it wouldn’t be hard to label assumed types on the existing sentences and going forward allow a ‘type’ field if this idea were adopted.