Hello,
Right now there are only a few thousand sentences available for Esperanto and we already have a lot of duplicates recorded. (See here why this is bad )
I used the Common Voice Wiki Scraper to get more public Domain sentences in Esperanto.
Since the Esperanto Community is small and the 1.2 Million hour aim is not realistic for us I decided to use very strict rules to get fewer sentences in higher quality. In my fist run I could get ~260 000 sentences, but after I applied some rules and added a Blacklist it boiled down to ~128 000 sentences and 96 000 without repetitiond. This is what I did:
- I excluded most letters that are not part of the Esperanto alphabet to excluded foreign words and phrases. I also excluded q, w, x and y which are not part of the Esperanto alphabet and this helped a lot to get rid of all sort of Names and Words in other Languages. Here is the rule file I created for that
- This file also includes some typical Esperanto abbreviations that I often found in the sentences and some stuff I shamelesly stole from other languages like deleting of double spaces.
- I also excluded unusal letter combinations that are only used in foreign words like sch, the, sh, cc,… this helped a lot to avoid german, english and italian words that are verry common.
- I created a blacklist with uncommon words, most of them are not in Esperanto. I choose to exclude words that are used less than 27 times. Thats much less than most other langues have chosen, but this wiki is smaller and the blacklist still contains more than one million words. EDIT: I later switched to >80 repetitions.
- I sorted everything alphabetically. This helped me to delete dublicates and I found some Russian sentences that somehow made it into the collection at the end of the list.
- After that I mixed the sentences in random matter again so that it feels natural again.
The result is this list of sentences (6.8 MB): https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/wiki.eo-80.txt
This is the list without dublicates: https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/no-dublicates-80.txt
Here are 300 randomized sentences from the latest extraction: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt
The error rate is pretty low, but there are still many non-Esperanto words in the sentences that I would like to avoid.
Where do we go from now? Do I have to put them all manually in the sentence collector or is there a better way?
Edit: updated files and numbers from my latest runs.