Speech-to-text json result with time per word

noobski_21 · October 19, 2018, 7:47am

Hi,

I’m currently using the SpeechMatics.com API to transcribe audio files into text, in the following json format

[
{name: "word1, time: 130, …}
{name: "word2, time: 132, …},
…
]

but considering the cost per minute, I want to use my own engine, I tested deepspeech and I think with learning, I will arrive at a good result, the only problem is that the text is in raw, and it is impossible for me to know when words was pronounced

any idea to reproduce speechmatics api result ?

thanx in advance, and sorry for my bad english

lissyx · October 19, 2018, 7:47am

Why don’t you use the library or its binding and build it yourself ? Besides, we have no way to produce a “time” that gets you when the word was spoken. There’s already github issue filed about that.

lissyx · October 19, 2018, 8:41am

I think you can achieve something similar with:

VAD
our streaming API

As you can see, libdeepspeech API will return you just a string, but then you can deal with that and produce JSON.

yv001 · October 19, 2018, 1:49pm

so from the referenced github issue it looks like the new ctc, once integrated to the project, could provide the character timestamps.

see this comment