I downloaded the tried the pretrained model 0.5.1 and the results were not so good when trying with audio files at 16Khz. What need to be done to get state of the art results? Thanks
The pre-trained model is a work in progress with a fraction of the data used in a production model and you will either need to wait for the model to improve or augment it with your own data.
Please give more context on what you are doing.
I recorded a file using amazon polly on audacity with rate 16KHz.
soxi joana.wav
Input File : ‘joana.wav’
Channels : 1
Sample Rate : 16000
Precision : 16-bit
Duration : 00:00:08.31 = 132912 samples ~ 623.025 CDDA sectors
File Size : 266k
Bit Rate : 256k
Sample Encoding: 16-bit Signed Integer PCM
deepspeech --model deepspeech-0.5.1-models/output_graph.pbmm --alphabet deepspeech-0.5.1-models/alphabet.txt --lm deepspeech-0.5.1-models/lm.binary --trie deepspeech-0.5.1-models/trie --audio joana.wav
Expected - hi my name is joanna welcome to same space how to install the data in my server and download it later on.
Got - hi my name is joanna welcome to same space how to install the date my serverino mood it later on
serverino is not even a word.
This is a very good quality audio and it should have been flawless
joana.wav.zip (217.8 KB)
So you have a feminine TTS voice.
I heard sand space
instead of same space
, and with the voice they use, I have a hard time hearing the server and download it later
, to be honest.
Our dataset for training still lacks good amount of feminine tone of voice, and you are relying on TTS. It’s not surprising the results are not perfect.
I’m not really sure what can be quickly improved in your case …
What data sets have been used to train this pretrained release model. I will exclude them and add other examples and fine tune. In the release page it is said that train_files
Fisher, LibriSpeech, and Switchboard training corpora.
Have you used the full datasets mentioned above or parts of it.
This is in the release notes.