Okay I just started a new run with a blacklist for words less frequent than 80. This was hard, because this list includes a lot of valid and interesting words, but also a lot of nonsense. But there are still 28 000 words allowed to build sentences from, so I think this will work.
Since this thread is mostly about the script and technical questions, there are two things I would like to know:
- I understand that you don’t accept pull requests with sentences from this script and that you have to run the script yourself to avoid legal problems when someone pulled too many sentences per article. But can I edit the result once the file is ready? I would only delete sentences and add nothing new. Some common errors are easier to delete by hand.
- For Esperanto it turned out to be extremely useful to exclude almost all letters that are not part of the Esperanto alphabet with disallowed_symbols. But this was a lot of work and slows everything down a lot. It would be much more useful to have a whitelist of allowed signs. Is this possible? I bet this could be also useful for some other languages.
- A lot of sentences are lost because of ignored abbreviations. It would be great if one could replace abbreviations with their full written meaning.
Edit: Done and I like the collection. To get rid of more foree words I also excluded letter combinations that are almost never used in esperanto, for example sch, sh, the,cc,… This filtered a lot more words out. I now get 128 000 and after deleting the duplicates 96 000 sentences. This would give us enough sentences for the next two or three years if we keep working in the same speed.
I updated the linked files in the post above and changed some text to make things clearer.
Edit: Abbreviations instead of apprehensions 🤦