Hi! To be more precise, I am a doctoral student working on NLP (in Dutch in particular) at the University of Ghent (UGent).
One of my projects is to build a language model for (Flemish) Dutch. As a result, I have over the past months collected a lot of resources (of various origins, but all suitable for research; exact licensing is usually possible to track down though not always readily available for all sources). I can therefore use these sources to train language models.
I don’t have a Language Model right now that would be tailored to generate Common Voice-type of sentences, but I have enough data and tooling to build one if that would be useful.
Running that language model to generate new sentences shouldn’t be a problem, but of course these sentences would then need to be manually reviewed for language accuracy, though I would expect the kind of short sentence needed for Common Voice to be almost all correct grammatically (longer sentences usually have a bigger chance to be incorrect than short ones).
That said, I would be interested to know what licenses (for the training text) I would be able to use to train a language model to then contribute sentences in CC0. I would naively assume anything that can be used for research purposes is fair game if the sentences subsequently generated do not match the training set, but this is a question we really need a legal team to look at. Technically, wikipedia is a Share-Alike license, but would sentences generated after learning a language model on top of wikipedia have to be Share-Alike too? The model probably would if it were released, but would randomly generated sentences? That’s my question I guess.
Food for thought is that models like OpenAi/GPT2 is using the MIT license, despite being trained on randomly sampled data from articles linked on Reddit. Google/BERT is Apache 2.0. (It is my understanding that BERT is trained on a derivative of wikipedia, but I could be wrong)