Concerns about Brazilian Portuguese and Dialects of other languages

Hi folks!

Common Voice is an awesome project. Thank you very much for that!

I have opened an issue at GitHub, but I believe that the Discourse platform it’s the correct place for this kind of discussion.

I have noticed this recent change: https://github.com/mozilla/voice-web/issues/1962

I believe that this could lead to a lot of adverse side effects. My main concerns:

  • Deprive countries of using this technology

  • Loss of new contributors

  • Loss of the contributions made so far

I’m also concerned if this is something that is happening to other languages.

Those concerns are a personal point of view. I can not speak for all Brazillian and would like to hear other opinions.

Deprive countries of using this technology

Brazilian Portuguese and Portuguese from Portugal are profoundly different. I am Brazilian and talked to other people in Brazil about it. If something is translated to Portuguese from Portugal or speak in Portuguese from Portugal, it’s not translated to Brazilian Portuguese.

I would love, for example, to build something with open source projects like Mycroft or Leon in combination with Arduino for my family. I could automate something for my mon’s house, but she doesn’t speak English and would not use something in Portuguese from Portugal because this is hard to understand and will never feel like something built for Brazilians. I have developed some experiments and toys with Arduino with my 10-year-old son, and the same applies here; Portuguese from Portugal doesn’t make sense for him.

They feel that they can’t have access to this kind of technology; it’s something just for other countries and sees the Portuguese from Portugal as a bad workaround that not worth using it.

Loss of new contributors

If someone from Brazil wants to contribute to the project and only see PT, it does not makes sense to contribute. It’s not something that we can use, because it’s not our language and forecasting, we would disturb the Portuguese from Portugal. A lot of “correct” spellings will be marked as wrong and vice versa. And honestly, if we do not see pt-br, it’s better to accept that our language is not supported, and it’s better to use the English version of something.

I make a personal experiment talking with some Brazillian friends about contributing to this project, and all of them have denied because it does not make sense to contribute to Portuguese from Portugal. To avoid some bias, I have tried some times advocate saying: “well, to the computer it’s the same thing, we have different accents in Rio de Janeiro, Porto Alegre, São Paulo, Recife… Portuguese from Portugal it’s just another accent for the computer…”; They just don’t buy it. A Brazillian friend of mine that have traveled to Portugal says that he spends two days to start to have some understanding of signs and the spelling.

It’s sad to lose the potential contributions of an entire country.

Loss of the contributions made so far

We have a lot of contributions. If people from Portugal start correcting the Brazillians spelling, a lot of things will be flagged as wrong. If Brazilians start to check the Portugueses spelling, the same applies here. We will end up with a “kind of global Portuguese” that is not suitable for any country at the end of the day.

If there is some technical issue, to avoid losing the contributions, maybe it’s better to shut down temporarily the Portuguese page until we can find another option than unify the languages.

Finally

There is something that I could to help you folks with that? Do my arguments make sense? There is some technical context for this change that am I missing?

I would love to discuss this further to get a better solution for the community.

Hi,

Thanks for your feedback. The question about languages and accents has been discussed many times already and specifically Portuguese.

We have done various researches, talked with linguists and native speakers and the decision we took was to avoid splitting the Portuguese text corpus (or any other language) into different ones depending on the country and identify differences on the voice corpus.

First, there is the technical limitation: It’s already a huge challenge to get at least 2 million sentences in a language, so imagine if we need 2M per country where a language is spoken.

For this project needs, we consider a dataset-language a common writing system that contains the same words, grammar and script, acknowledging that non-formal expressions can happen in different territories.

Spanish for example has way more local variants and accents and this hasn’t resulted into any issues, people understand that from time to time they might see a sentence with a local expression.

Second, about voice recordings: Ideally in the future and once we have implemented the new accents strategy we are working on, we could potentially offer people to validate only voices from their region (if we have enough of them), but unfortunately we are away from that ideal scenario.

Having said that, I understand that some people feel more or less strongly opinionated about this, but we need to take in consideration the data we have at our disposal, our goals and limitations.

The current way we are collecting data (once we improve how we identify different accents better) will allow to train STT models to understand Portuguese from both Brazil and Portugal because we are capturing that metadata on the voices, we are aligned there.

Thanks again!

Hi! Thank you very much for the explanation. It makes everything very clear and makes perfect sense!

One thing that I realize that isn’t very clear to people is that the Common Voice Project is about Speech to Text, not Text to Speech. When I explain it (and realize it), people agree that being capable of understanding different dialects of Portuguese to transform it into text, it’s a good thing!

1 Like

Well, I’ll add my two cents here, @gbaptista I hear your concerns, the accent strategy is something that we need to take into consideration, but at least for me it’s very hard to tell if a sentence is in either Pt-Pt or Pt-Br, I could argue that so far the majority of speakers are from Brazil, and that makes total sense since Brazil has over 20x more people than in Portugal, one thing to note is that in the contribute page you don’t specify the dialect, the contributor has this metadata on his/hers profile, just adding an option there for Portuguese would solve this issue for now, with regards to the dataset, I think having a common language with different dialects will be far more important for contries that speak portuguese, but with far less native speakers, also maybe adding some reginal metadata about accents in Brazil, could help improve the dataset overall what do you think?