[Technical feedback needed] Wikipedia extractor script beta

stergro · September 4, 2019, 2:14pm

Hello,
Right now there are only a few thousand sentences available for Esperanto and we already have a lot of duplicates recorded. (See here why this is bad )
I used the Common Voice Wiki Scraper to get more public Domain sentences in Esperanto.

Since the Esperanto Community is small and the 1.2 Million hour aim is not realistic for us I decided to use very strict rules to get fewer sentences in higher quality. In my fist run I could get ~260 000 sentences, but after I applied some rules and added a Blacklist it boiled down to ~128 000 sentences and 96 000 without repetitiond. This is what I did:

I excluded most letters that are not part of the Esperanto alphabet to excluded foreign words and phrases. I also excluded q, w, x and y which are not part of the Esperanto alphabet and this helped a lot to get rid of all sort of Names and Words in other Languages. Here is the rule file I created for that
This file also includes some typical Esperanto abbreviations that I often found in the sentences and some stuff I shamelesly stole from other languages like deleting of double spaces.
I also excluded unusal letter combinations that are only used in foreign words like sch, the, sh, cc,… this helped a lot to avoid german, english and italian words that are verry common.
I created a blacklist with uncommon words, most of them are not in Esperanto. I choose to exclude words that are used less than 27 times. Thats much less than most other langues have chosen, but this wiki is smaller and the blacklist still contains more than one million words. EDIT: I later switched to >80 repetitions.
I sorted everything alphabetically. This helped me to delete dublicates and I found some Russian sentences that somehow made it into the collection at the end of the list.
After that I mixed the sentences in random matter again so that it feels natural again.

The result is this list of sentences (6.8 MB): https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/wiki.eo-80.txt
This is the list without dublicates: https://raw.githubusercontent.com/stefangrotz/common-voice-eo-vikipedio/master/no-dublicates-80.txt
Here are 300 randomized sentences from the latest extraction: https://github.com/stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

The error rate is pretty low, but there are still many non-Esperanto words in the sentences that I would like to avoid.

Where do we go from now? Do I have to put them all manually in the sentence collector or is there a better way?

Edit: updated files and numbers from my latest runs.

nukeador · August 29, 2019, 11:47am

@stergro I see you have created a pull request, we’ll just need a few details about the output:

stergro · September 4, 2019, 2:11pm

Hey @nukeador thanks for the quick reaction. I closed my pull request and there will be another one in a few days.

I worked on the rules, added more abbreviations and most importantly I made a analysis about the frequency of the letters in my last sentence list. This helped me a lot to colect a big number of letters that are now also excluded in the rule file. This slows everything down but helps a lot with the quality. There are still a few foreign words in some sentences, but they are all at least written in letters that exists in the Esperanto alphabet.

Hey my fellow Esperantists @tirifto @Pablo_Busto @Mte90 @nicolaruggiero1986 Would you like to help me to guess the error rate for these new sentences from the Wikpedia in Esperanto? I created a file with 300 random sentences from the 96 000 sentences. I would guess that we have an error rate around 5/100, what do you think? It is enough if you only look at the first 100 or 200 sentences but it is important that you give me a number because we can only get the file into Common Voice if we have an error rate.

github.com

stefangrotz/common-voice-eo-vikipedio/blob/master/random300-github-review.txt

Ĝi estas helbruna migranta birdo.
Baldaŭ ŝi iĝis la fokuso de katolika ribelo.
La pordo havis centran arkon kaj du laterojn pli malgrandajn.
La blazono ankaŭ estas parto de la flago de Montenegro.
Vilaĝservoj estas infanĝardeno, kulturdomo, kuracejo, vendejoj, migra poŝto.
Ekster tio oni kunlaboras kun fondaĵoj politikaj, eldonejoj, bibliotekoj.
Li ricevis sian diplomon unu monaton antaŭ la eko de la Unua Mondmilito.
S-ro pagis tiun sumon.
Antaŭ la enirejo kolono de Sankta Triunuo staras.
Simile estas pri integraloj.
Ĝi estas uzata en Vindozo.
Li komencis komerci kaj tre riĉiĝis.
Ili ja ne estas anstataŭigo de ludiloj.
Ĝi estas servita aŭ kovrita per klara dolĉa siropo.
Li iĝis juristo, sed li interesiĝis pri la pentrado.
En Respubliko Makedonio ilia identeco ankaŭ estas bazita plejparte la religio.
Kvankam la teksto estis sekreta, la esenco atingis la publikon.
Li edziĝis kaj ekhavis du infanojn.
Post kelkaj jaroj, li fariĝis prezidanto de Itala Esperanto-Federacio.
Post jardekoj minado de stano komenciĝis en la apudaĵo.

This file has been truncated. show original

Mte90 · August 30, 2019, 11:17am

In esperanto is more simple compared to other languages because the alphabet has specific letters so I don’t think that will be a problem with the extractor.
Compared to italian or spanish that we need to exclude as example greek or german letters with esperanto is more simple because rewrite the words using their own alphabet.

stergro · August 30, 2019, 12:21pm

That’s not completely true, I excluded all letters that are not part of the Esperanto alphabet with the script but only a fraction of the articles translated everything into the Esperantoaalphabet. Since Wikipedia is a dictionary there are still a lot of words in the texts that are not trancripted into Esperanto. One example:

Ĝi ŝuldas sian nomon al itala urbo Lecce, ĉefurbo de la provinco Lecce.

But the error rate is more about general errors like cutted sentences, grammar errors, typos and so on.

nukeador · August 30, 2019, 12:22pm

Thanks to the Catalan and Esperanto communities feedback I’ve updated the repo readme to clarify the expectations on how to get rules into the repo and also sentences extracted and incorporated into CV

Mte90 · August 30, 2019, 12:33pm

I think that is acceptable as error rate, start focusing on wrong translation for esperanto can be a pain and a very huge task.
Also for grammar errors this is something that is quite impossible to fix if not working on wikipedia itself, that was chosen because is more easy to avoid issues.
Probably will be more simple for esperanto to do a dictionary and check if all the words of the sentence are in there but will be very expensive for the tool to do that.
For cutted sentences again it is a problem of the scraper and not of the language that need to be reported.

I will see if there are “disallowed” words for italian or something like that.

stergro · August 30, 2019, 1:21pm

I already created a blacklist with over a million disallowed words based on repetition. I choose words that apear less than 27 times in all texts, maybe this was too little. Other languages choosed to avoid words that appear less than 80 times but this would include many valid words. I will create another blacklist with more words this evening.

stergro · September 1, 2019, 7:33pm

Okay I just started a new run with a blacklist for words less frequent than 80. This was hard, because this list includes a lot of valid and interesting words, but also a lot of nonsense. But there are still 28 000 words allowed to build sentences from, so I think this will work.

Since this thread is mostly about the script and technical questions, there are two things I would like to know:

I understand that you don’t accept pull requests with sentences from this script and that you have to run the script yourself to avoid legal problems when someone pulled too many sentences per article. But can I edit the result once the file is ready? I would only delete sentences and add nothing new. Some common errors are easier to delete by hand.
For Esperanto it turned out to be extremely useful to exclude almost all letters that are not part of the Esperanto alphabet with disallowed_symbols. But this was a lot of work and slows everything down a lot. It would be much more useful to have a whitelist of allowed signs. Is this possible? I bet this could be also useful for some other languages.
A lot of sentences are lost because of ignored abbreviations. It would be great if one could replace abbreviations with their full written meaning.

Edit: Done and I like the collection. To get rid of more foree words I also excluded letter combinations that are almost never used in esperanto, for example sch, sh, the,cc,… This filtered a lot more words out. I now get 128 000 and after deleting the duplicates 96 000 sentences. This would give us enough sentences for the next two or three years if we keep working in the same speed.

I updated the linked files in the post above and changed some text to make things clearer.

Edit: Abbreviations instead of apprehensions 🤦

nukeador · September 2, 2019, 11:14am

Yes, only deletion PRs are accepted on sentences to remove wrong or bad sentences.

This might be something to consider if brings more quality to some languages. Can you please open a github issue so we can check with devs the best approach? Thanks.

This would probably fall into the new features category. Can you please also open a github issue about it so it doesn’t get lost? Thanks

stergro · September 2, 2019, 8:24pm

Okay I opened an isse about the whitelist here #50 and there is already an old Isse about converting abbrevations: #9

txopi · September 3, 2019, 10:31pm

I couldn’t install “cargo” so another Basque contributor made the extraction for me.
Basque Wikipedia’s result is available here (25MB): https://ikusimakusi.eus/bitartekoak/wiki.eu.txt
It contains about 399.474 rows, 200.362 of them repeated. So the unique sentences on the file (after a sort -u) are 49,85%.
I checked the first 100 sentences and 90 are right. 10 are wrongly cut apparently because the same reason; all they start with “mendean” [in the century], probably because the scrapper cuts sentences like “XX. mendean…” [in the 20th century…] as “XX.” and “mendean…”.
- Basque, as other languages, uses Roman numerals to express centuries. But surely the problem isn’t that, but that Basque uses dot character to separate sentences but also to express ordinals: “1. atera” [To the 1st gate], “2. atean” [On the 2nd gate] “XIII. mendea” [13rd century], etc.
  - As we don’t want acronyms nor digits on Common Voice’s sentences, I think this cutting problem (at least for Basque language) could be fixed skipping digits ([0-9]) and contiguous upper-cases ([A-Z][A-Z]). This wouldn’t fix Roman ordinals with just one character like “I”, “V” and “X” but would help. It’s just an idea…

nukeador · September 4, 2019, 12:52pm

These are great advances @txopi

You might be able to play with the rules files and the blacklist to avoid Roman ordinals. Other people in this topic would be able to help with the regex.

Once you have a set of rules and blacklist that produce an output that is rated as <7% error rate by 2-3 native speakers, feel free to open a PR adding the following information:

How many sentences are you getting?
How did you create the blacklist? (specify the criteria, i.e words with <80 repetitions)
Get 2-3 additional native speakers (ideally some linguistics) to comment here with the estimated error rate. You can share with them a few samples of 500 random sentences from your output.

Cheers.

lipkij · September 7, 2019, 5:22am

Hi, trying to collect for russian and
in step:
cargo run – extract -l russian -d …/wikiextractor/text/ >> wiki.ru.txt
there is error:
Compiling punkt v1.0.5
error[E0554]: #![feature] may not be used on the stable release channel
–> /home/user/.cargo/registry/src/github.com-1ecc6299db9ec823/punkt-1.0.5/src/lib.rs:141:1
|
141 | #![feature(proc_macro_hygiene)]
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

error: aborting due to previous error

For more information about this error, try rustc --explain E0554.
error: Could not compile punkt.

To learn more, run the command again with --verbose.

Could you help with this?

animaljamparty74 · September 8, 2019, 2:05am

@txopi awesome!! wiki links! i will check them out

animaljamparty74 · September 8, 2019, 2:06am

@discobot can you help with all the wikipedia links??

nukeador · September 10, 2019, 11:13am

You will need to run Rust nightly as commented in the README:

Feel free to open github issues if you are still facing any issues.

Thanks!

txopi · November 8, 2019, 6:45pm

We keep working on this. Basque language has a lot of declensions (suffixes), so the automatic black list has lots and lots of correct words. We have been cleaning it (just partially) because we didn’t want to reject so many correct sentences of the Basque Wikipedia.
We finally achieved an acceptable result, so we can continue with the process! More info soon.

Fjoerfoks · December 11, 2019, 10:19am

After some installation hassle, I finally was able run the script for Frisian. I lowered the threshold for generating the blacklist to 50. Still see some very useful words in that list, so going to lower it again. As a test I ran the scraper which ended up with 22.500 sentences. As mentioned above there are a number of incomplete sentences due to abbreviations. A list with known abbreviations with their full explanation would result in more sentences.
Gonna experiment some more with settings and take a closer look at the results.

Fjoerfoks · December 15, 2019, 12:44pm

Did some more experiments and finally end up with almost 60k sentences. These have to be enhanced with a solution for abbreviations, otherwise there are too many shortened sentences. So next to the rules in <language>.toml and the blacklist (best result with the frequency limit set to 1 for Frisian), there has to be a file with known abbreviations with something like:
e.g. = for example
i.e. = in other words
…

Next up , Dutch (will probably take up more time generating )