Hello everyone,
I would like to open this topic to collect feedback from all our communities and partners. This topic contains a proposal crafted by the Common Voice and Deep Speech staff teams based on conversations with different volunteers and linguistics experts.
We will keep this topic open for feedback until May 26th. After that the same team will gather the input and create a final version.
We will seek consensus and agreement from most people involved but the ultimate decision maker will be George Roter (@george)
What we want to know from you:
- Does the proposal resonate with you or your language?
- Do you have any flags?
- If so, whatâs the issue and why is it important?
Thanks for your comments!
Context and background
Languages
We realize the way Mozilla has historically identified languages/locales with variants might not always be useful for Common Voice and Deep Speech goals.
We consider a language a combination of a common writing system that contains the same words and grammar, acknowledging that non-formal expressions can happen in different places.
Each language should have just one data-set, and it shouldnât contain words that are not part of the language (different symbols or scripts).
Accents
We consider an accent the combination of intonation (sound) + phonetic (spoken letters).
Accents are usually coming from places, having for example different accents in different cities where the same language is spoken.
For Deep Speech the more concrete, the better. Having information about the location of an accent is super useful, the following list details from less useful to more useful the information we are looking for:
No data < I donât know < Country < Region < City
We should seek for the most concrete location for an accent, ideally, nearest city.
When a language is not your native one, your accent can come up from your native language or if you are close to native, it can come up from a location.
We should allow people to self-identify where their accents are coming from.
Languages strategy proposal
Common Voice will only accept as languages the ones that follow the language definition previously explained. We wonât allow language variants to be considered different languages, that information will be captured as accents.
We will allow people to identify which languages their consider themselves as native.
Accent strategy proposal
We will ask people for their accents in each language.
For languages they identified as native we will ask âWhere does your accent comes from? (select the nearest place)â, options:
- I donât know
- Country list
- Regions list
- Cities list
We want to encourage people to select the most concrete location, preferably their nearest city (we would like to provide autocompletion from a list of known world cities so this is not free text).
For languages not identified as native we will ask âWhere does your accent comes from? (select the nearest place)â, options:
- I donât know
- My native language
- Country list
- Regions list
- Cities list