Sentences that include groups of uppercase characters

txopi · January 9, 2019, 11:23am

I have a doubt about how to prepare the sentences that include groups of uppercase characters to express abbreviated terms. I’ll put examples in English, although my target language is Basque, but I think this issue is similar in many languages.

Abbreviations like FBI, BBC, KGB… are spelled letter-by-letter. Should I let them just as they are in the text ("FBI")? Or perhaps I should separate the letters with spaces ("F B I") so the after-processing work will recognize that they are separate letters and will expect people pronouncing spelled letters in the recordings?

Abbreviations like NATO, UNESCO, NASA… are pronounced as words because their syllabic configuration. Should I let them just as they are in the text ("NATO")? Or perhaps I should put them in lowercase ("nato") to differentiate them from the letter abbreviations?

I understand that writing both type of abbreviations as they are written in normal situations, will make the trained result poorer. But perhaps I’m wrong and nothing has to be done.

This is my doubt I would appreciate if someone can put some light in this subject or I’ll decide just to avoid all the abbreviations in the collected sentences. Unfortunately I suspect this decision also can produce a worse trained system, but I have no clue on deep learning techniques and perhaps I’m absolutely wrong.

What do you recommend me to do with Basque letter and syllable abbreviations?

davidak · December 17, 2018, 5:31pm

I think abbreviations should get pronounced like people would normally do since a STT engine should be able to handle that.

txopi · December 17, 2018, 5:31pm

Hi dabidak. My doubt isn’t how a speaker has to pronounce the words, but how to write them in the data set. You mean just write them as usual (NATO, BBC, UNESCO, FBI…) and the STT engine will make all the work by itself?
NOTE: There are also hybrid abbreviations like JPEG.

davidak · December 17, 2018, 5:31pm

Also there is would just write them how they occur in normal text.

nukeador · December 17, 2018, 5:31pm

@lsaunders and I will be meeting with the deep speech team soon to talk about this “cleaning” of the sentences so they are fully useful for the engine. We will let you know once we have the details.

nukeador · January 9, 2019, 11:24am

The sentence collection tool how to now has some guidance on requirements

https://common-voice.github.io/sentence-collector/#/how-to

dabinat · January 9, 2019, 5:45pm

Here’s a question: if the engine is capable of recognizing the letters U, S and A individually, does it really need to be taught the acronym USA?

txopi · January 9, 2019, 9:36pm

As far as I know, the engine doesn’t do what you say. You can find this explanation in the how to: Abbreviations and acronyms like “USA” or “ICE” should be avoided in the source text because they may be read in a way that does not coincide with their spelling. Additionally, there may be multiple accurate readings for a single abbreviation. For example, the acronym “ICE” could be pronounced “I-C-E” or as a single word.