How to deal with academic and public domain license for model usage

Just as the title we (for Italian) have a problem.
Right now the majority of datasets are from the academic world and they don’t have any license but need a citation of the paper.
So for the italian model https://github.com/MozillaItalia/DeepSpeech-Italian-Model/ we are avoiding them because we don’t know how to deal with them.

On https://hacks.mozilla.org/2019/12/deepspeech-0-6-mozillas-speech-to-text-engine/ are mentioned two academic dataset that have that issue, no license but citation required.

So my question is we can use them and release a public domain model? Or we need to mention that we are using and also the users that use the model itself?
We have the same problem for audio+text and text only dataset, also on using CC (also non-commercial) to generate a model.

I started also a discussion in Italian on reddit https://www.reddit.com/r/ItalyInformatica/comments/e6ffyg/licenze_open_source_e_paper_accademici/ to understand better the problem.

Because if we can use those stuff and license the model as public domain also if we are using to generate it resources from different sources with different license, will change our project because we will not have any limit.
The point we raised is we can use stuff license in a way and release something that elaborate this stuff (or maybe just a part) create issue for the whole project.

Probably Mozilla with legal team can help on understand this. Including the issue of that every country has different regulations…

Can you clarify what those two datasets you’re referring to are?

Fisher https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2004-fisher-corpus.pdf and Switchboard https://catalog.ldc.upenn.edu/LDC97S62 (this one a license that I don’t think that is open source)

Those datasets are proprietary, but they do have licenses, which we paid for.

So I am wondering in case of academic dataset with only citations we can use it to release a public domain model or we need to mention them? After all we are using those data to generate something else.

I would agree in principle, but I think we had to get that verified by lawyers.

Do you have some datasets already identified ?

An example in my case is http://www.mspkacorpus.it/, we already written to those email with no answers in over 10 days.
But is just an example how those dataset are released, no license just a citation to do.

If there is no license, the default is copyright, unfortunately. I was not involved in the negociations for the datasets we paid for, so I’m unsure how that plays here.