I have been comparing word counts/frequencies of the English dataset and I noticed that it is skewed in some ways. For example, I only found 4 unique sentences with curse words in them (all ended up in the test set). I also only found two unique sentences with the word ‘hello’ in them. In the ‘train.tsv’ ‘hello’ appears once out of 128004 words. That is roughly 7.81 instances per million where common speech would be more like 104.11 per million. 8 per million is much closer to print, which is obviously not desirable.
I understand why this happens, it is hard to find sentences so this limits the representation for sources of speech. I appreciate what it takes to gather sentences and clearly there are efforts to improve the sentence collection. I think there is another way to, if not mitigate these distribution issues, at least allow better understanding of the data. Has there been any thought to adding a tag to help identify the source of speech in an utt? A couple of standardized classifiers could help understand what distribution this data comes from and possibly inform data collection in the future. I hope the utility of this information for studying speech recognition is obvious.
If this already exists somewhere or this has already been discussed I am sorry for re-hashing!
A good source of frequency data for comparison can be found in the COCA data, that is where I drew the comparison frequency for ‘hello’: https://www.english-corpora.org/coca/ .