The reason I think we can use CC-BY text:
The implications of this is ending up with Data-sets under CC-BY, as far as I understand that license shouldn’t pass to deep speech or any AI training.
I asked that question on Quora regarding machine translation and got a reasonable answer.
Is it legal to train neural networks to build machine translations based on copyrighted texts?
Yes. It makes no more sense for that to be illegal than it makes for it to be illegal for a human being to study how to translate texts by comparing copyrighted texts to translations of them.
Now, if you use that neural network to then translate a copyrighted text, the translation is subject to the original copyright, just as would be the case if you learned to translate the language yourself and then translated the text.
Training an AI is like teaching a child, I can use copyrighted materials to teach a kid a skill, he/she is the sole owner of the fruits of that skill.