Stats about Common Voice: Kabyle Corpus

belkacem77 · March 11, 2019, 8:22pm

I’m sharing with the kab contributors on our FB page some stats about the kab corpus. The corpus was analyzed after tokenization and pos tagging using NLTK Perceptron Tagger. I used a model I have already generated from another corpus.

For graphs and networks I used: matplotlib, numpy, networkx and pylab

I analyzed:
Word length
Sentence lenght
Grammatical classes (tags)
Punctuation VS Alphabet
Verbs/Aspect
Verb occurence
Word Occurence

We use these stats to avoid repetitive words and syntatic forms.

2019-02-07%2012_12_12-Figure%201

nukeador · February 7, 2019, 12:25pm

Cool! Is this something that can be re-used for other languages? Are the tools to generate this openly available?

Cheers.

belkacem77 · February 7, 2019, 1:04pm

Yes but the scripts deal only with Kabyle language. I mean tokenization, POS tag… they are free on Github/Gitlab since months or years :smile I’ll check if I uploaded the last updates (mozillakab on github and I use mostly French to explain/describe things)