General status of DeepSpeech

Hi, I’d love to have some URL where someone in the know routinely updates the rest of the world on how well DeepSpeech works by now, in practical terms. What can it do and what can it not do?

We would all love to see a free speech recognition!

Many greetings,
Stefan
BotCompany.de

This is already documented on each release. Can you elaborate what is unclear since you mention “practical terms” ?

Well, that’s really two separate things:

  1. How well DeepSpeech functions as a recognition engine in general.

  2. How good a particular model is.

From what I’ve seen of DeepSpeech so far, I think it could be used in a production environment if the model was good enough. But you would have to train with your own data because the pre-built model doesn’t have enough data to be production-ready IMO.

The WER is calculated from a clean test set, so you probably won’t get scores anywhere close to that on real-world data. But it will be interesting to see how the augmentations in 0.6 affect this.

Well, normally I use Chrome’s speech recognition and/or wit.ai.

Practical questions about DeepSpeech are:
-Are arbitrary length input files supported? The first release only accepted a few seconds of input at a time IIRC.
-Which languages are supported (english only)? I’m generally talking about pre-trained models.
-Is there and end-of-speech detection? Probably easy to add, but still it’s a question. (wit.ai doesn’t have that either.)
-How is recognition speed, generally? Both Google and wit.ai services are very fast.
-How much hardware do I need to run it efficiently?

The last release I saw release notes for is 0.5.1. There, I read:

“This release includes source code and a trained model
trained on American English which achieves an 8.22% word error rate on the LibriSpeech clean test corpus.”

So I guess that answers the quality question, more or less. Not sure how good or bad 8.22% is. Actually it seems very good, I’ve read somewhere else that “in 2018, 25% is an average word error rate among the various speech recognition services on the market”.

Thanks

I’m not sure what you are referring about. First versions had some limitation requiring the whole audio to be known at inference time, and performances were suboptimal on long audio files. This has been fixed a long time ago.

As you can see on each releases pages, yes, we only have English for now. Sourcing material is in itself a long job, so we can’t do everything.

I’m not sure I understand your question …

That depends on your hardware …

That depends on your system. Any desktop CPU should provide more than realtime performances. We could verify above realtime on Android devices with Snapdragon 820 and 835, as well as on Raspbian running on RPi4.

You are comparing unrelated things: benchmarks and real-life use. We still have a long way to go, especially regarding amount of data. This will make the model more reliable (noise, accents, etc).

We don’t provide an online service.

I know you don’t provide an online service. I want to get free from those online services.

It means that the speech recognizer detects the end of an utterance in a continuous audio stream.

I’m not sure we share the definition of “end-to-end”, but we don’t do that. There’s support for streaming, allowing you to feed audio continuously. It’s still up to you to take care of VAD and handle start/stop of stream.

Never mind, my mind read that as end-to-end.

Perfect, it’s just that since this was your point of comparison, I wanted to make sure you are aware that we don’t provide such hosting. Technically, it’s obviously doable, so you could provide your own if you’d like / need.