Hello. Thank you for an amazingly simple implementation of a wonderful idea. Rather than randomly including writing, you might consider using some already qualified public domain resources. There is a list of such resources here: https://en.wikipedia.org/wiki/Wikipedia:Public_domain_resources
Of course, while you could consider any resource from that page, I would ask that you specifically consider the inclusion of healthcare terms that included simple medical terminology. Not the kind of things that doctors say about their healthcare (that is too technical and specific) but the types of things that patients might like to read and or discuss in everyday terms.
An amazing resource for this that is written in lay terms is the Medline plus website. Not everything on Medline is public domain, but they specify what is covered and what is not here:
https://medlineplus.gov/copyright.html
Note that the Medline encycolpedia is licensed content and is therefore not public domain. But the Health Topics are public domain:
As are the FAQ answers
https://medlineplus.gov/faq/disease.html
And the medline plus magazine
Obviously, as a healthcare data journalist I have an ax to grind here, but there is a huge amount of english sentences here that are not medically contextual. For instance the sentence “The people who write the materials are the ones who decide if they are easy to read.” is found on one of the FAQ pages. Moreover, while the terms in Medline are intended to be “laymans terms” they include words like “Alzheimer” which are common enough words, that will likely have huge pronunciation differences.
I should note that the sections on women’s health topics in Medline are likely to include more sentences including female pronouns.
Given that you are interested in resources that are not medical, I would also suggest the Federal Register, which is also without copyright.
example text:
https://www.gpo.gov/fdsys/pkg/FR-2015-01-02/html/2014-30754.htm
It should be relatively simple to run a script which removes all sentences that include the goggly-gook internal reference system and also acronyms. NIST, NASA, etc. Once that is done, this would be a huge corpus of sentences that should be composed of relatively simple english sentences. If you wanted to ensure that the sentences were even more “common language-full” you might simple exclude everything except the contents of the executive summaries of the articles, which are intended to be relatively jargon-free.
If that is still not enough, you should consider including the text of comments made to various regulations on regulations.gov. Most people are unaware that the comments that they make on regulations themselves become public domain. See here: https://www.regulations.gov/userNotice
This data is available via an API, and here is an example:
https://www.regulations.gov/document?D=VA-2016-VHA-0011-184061
HTH,
-FT