I have a doubt about how to prepare the sentences that include groups of uppercase characters to express abbreviated terms. I’ll put examples in English, although my target language is Basque, but I think this issue is similar in many languages.
Abbreviations like FBI, BBC, KGB… are spelled letter-by-letter. Should I let them just as they are in the text ("FBI"
)? Or perhaps I should separate the letters with spaces ("F B I"
) so the after-processing work will recognize that they are separate letters and will expect people pronouncing spelled letters in the recordings?
Abbreviations like NATO, UNESCO, NASA… are pronounced as words because their syllabic configuration. Should I let them just as they are in the text ("NATO"
)? Or perhaps I should put them in lowercase ("nato"
) to differentiate them from the letter abbreviations?
I understand that writing both type of abbreviations as they are written in normal situations, will make the trained result poorer. But perhaps I’m wrong and nothing has to be done.
This is my doubt I would appreciate if someone can put some light in this subject or I’ll decide just to avoid all the abbreviations in the collected sentences. Unfortunately I suspect this decision also can produce a worse trained system, but I have no clue on deep learning techniques and perhaps I’m absolutely wrong.
What do you recommend me to do with Basque letter and syllable abbreviations?