Adyghe-multi dialects in a single dataset

daniel.abzakh · October 30, 2019, 8:12am

@nukeador
In Adyghe language, the western and eastern dialects vary, should they be separated? Would that effect negatively on DeepSpeech, can DeepSpeech be taught multi dialects in a single dataset?
My take on this:
1- I would rather keep them together in order to concentrate efforts in one direction.
2- I want DeepSpeech to be able to recognize the language with multi dialects, without the need to specify which dialect it should listen to in order to do that.

Let’s assume that the answer is No, meaning “No they should be separated”, then can we still collect them in a single data-set, and flag the sentences with their associated dialect, that will make it easy to separate them later if needed for DeepSpeech. Also that will enable us to add even more dialects, because the eastern and western dialects have their own inner dialects as well.

Sincerely,
Daniel.

nukeador · October 30, 2019, 12:09pm

Hi,

What’s your definition of dialect here?

We are still finalizing our languages and accents strategy, but right now we try to accept as the same dataset-language for the text corpus, a common writing system that contains the same words, grammar and script, acknowledging that non-formal expressions can happen in different territories.

daniel.abzakh · October 30, 2019, 12:38pm

I meant by dialect that two people can communicate and intangibly understand each other for the most part, even though the two dialects have grammar, written and word differences.

nukeador · October 30, 2019, 12:56pm

The text corpus determines Common Voice dataset, if these two can’t have a unified text corpus due different grammars and words we should consider them different dataset-languages.

In any case we should consult with linguists in these languages to determine if these differences are significative enough to justify different text corpus.

apequab · November 3, 2019, 10:15am

Hi Daniel,
Interesting topic. I think, in the beginning, it will be good to have both dialects separate, so that we will have two databases, one with text and recorded audio of the western dialect and same with the eastern dialect.

Then, once we have enough data of both dialects, it will be good to to compare them and see the difference, with further combining them, so DeepSpeech would know which dialect it is.

For example, with the used Cyrillic script, currently, there are some grammar and writing differences between both Circassian dialects:

“Нэф” = 1. “Light, Light color” - in western dialect “ady”
= 2. “Blind” - in eastern dialect “kbd”

“Нэшъу” = 1. “Blind” - in western dialect “ady”
= 2. pronunciation and writing of “шъу” is not used in the western dialect “kbd”

“Нэху” = 1. “Light, Light color” - in eastern dialect “kbd”
= 2. pronunciation and writing of “ху” is not used in the eastern dialect “ady”

to summarize:
“Light, Light color” = 1. Нэф = western dialect “ady”
= 2. Нэху = eastern dialect “kbd”
“Blind” = 1. Нэшъу = western dialect “ady”
= 2. Нэф = eastern dialect “kbd”

daniel.abzakh · November 4, 2019, 11:13am

@apequab You are probably right.

I will try to get in touch with someone to get an advice.

daniel.abzakh · November 7, 2019, 9:21am

@apequab
I asked a philologist about this situation, she mentioned that although the two dialects have some differences, they are using the same Cyrillic writing system.
Nonetheless she recommended the separation of them, in that case I think they can be separated as follow:
1- Adyghe - ady [ISO 639-3]
2- Adyghe (Kabardian) - kbd [ISO 639-3]

Any thoughts on this?

nukeador · November 7, 2019, 10:21am

Are they using the same written vocabulary and grammar? (minor local expressions are fine)

Can speakers read the same text corpus without problems?

apequab · November 7, 2019, 10:56am

Yes, true, although having one Cyrillic writing system, there are some minor differences in writing and pronunciation between “ady” and “kbd” dialects of the Circassian language.

I think, in the beginning, it will be helpful to have both dialect separate, so we can have both recorded and validated. Further, we can better compare them and see the differences, and after that, we can combine both so DeepSpeech can identify them when spoken/written.

apequab · November 7, 2019, 11:08am

At the moment, written vocabularies and grammer are not the same, due to some words are written and pronounced differently.

For example, east Circassian dialect “kbd”, speakers cannot understand very well, read or write the western dialect “ady”. But for the west Circassian “ady” speakers, it’s easier to understand, read, write the eastern dialect “kbd”.

Currently, it is a matter of knowledge, as personally, I understand, read and write both dialects.

daniel.abzakh · November 7, 2019, 11:44am

@nukeador @apequab
Having one corpus might create issues for eastern Circassians when recording and validating.
Looks like it would be better if they are separated.

apequab · November 7, 2019, 11:54am

Yes, agree.
In this case, the Adyghe (Kabardian) “kbd” [ISO 639-3] has to be enabled on Common Voice and Pontoon as well.

daniel.abzakh · November 11, 2019, 8:26am

@nukeador @pmo
Can we add the following language? (@apequab can be the contributor)
Code : kbd

Name : Adyghe (Kabardian)

Script : Cyrillic

Plural rule : Adyghe

CLDR plural : Adyghe

Direction : left-to-right

Population : wikipedia - Kabardian language

Regards,
Daniel.