Many approaches are available for what are described as ālow-resource languagesā in the literature, including transfer learning from models taught on high-resource languages. Every bit of data helps, and 1 is much better than 0 data. Also read about zero-resource learning, where there is actually no training on the target language until itās time to do recognition! In this challenging case you might just start with something like a target language word list and a recognizer trained on another language (so not quite āzeroā). As soon as you have even a small dataset with transcribed segments as in Common Voice, you should be able to do much better.
I believe the Deep Speech model, unmodified, is simply one of the highest-performing architectures when you have lots and lots of data available, but there are hundreds of other architectures around.
In my opinion, what needs the most attention now is prompt design (i.e. the sentence collector). With big recurrent neural networks, I think itās really best to have very little repetition of prompts and of prompt wording. We also need to make sure weāre getting speaker IDs right, so that model developers can strictly partition speakers into training and validation sets.
" * Is a smaller dataset still useful for other work not related with Deep Speech?"
Very much so, if it is collected carefully, speaker metadata is recorded accurately, etc. Hereās an example of an interesting (copyrighted) datasetāitās collected from audio bibles in 700 different languages, and has been used to train the Festvox TTS (text to speech) system (I believe for all 700 languages): https://github.com/festvox/datasets-CMU_Wilderness Thereās no way you wouldāve seen TTS systems for so many languages without the data.