Hi, I’m using the deepspeech python package to make a speech-to-text inference on a wav file. I see some output like this:
Loading model from file models/output_graph.pb
Loaded model in 0.249s.
Loading language model from files models/lm.binary models/trie
Loaded language model in 1.428s.
Running inference.
yes
Inference took 9.582s for 5.000s audio file.
Does this mean that the model predicted that my wav file contained the word “yes”? Is there an estimated confidence/accuracy score on this prediction? Were any other files with prediction information created?
Is it possible to get timestamps on the predicted text? For example, the model predicted that the wav file contained someone saying the word “yes” starting at 1.000sec and ending at 1.010 sec.