Deepspeech for production server

I want to build Speech Recognition server. Approximately concurrent requests - 10.000-100.000.
And I have two questions about it.

  1. Is it good idea to use Deepspeech for this case. Does anyone have experience with something like this?
  2. What approximately amount of (RAM, CPU/GPU) i need for this purpose?

Hello! It depends on many things. In my machine, which has RTX 2080ti and M2 -storage. 5 sec audio takes less than 1 second to convert to text. This speed depends on size of language model and parameters and of course size of acoustic model. If you are doing domain specific model vs general model … general model should cover a lot more words than domain specific model. (Domain specific model 100-500 hours of audio vs. general model several thousand hours of audio … )

You should also have some load balancer which would route traffic to those machines which do inference. One core can do one inference. My machine has 6 cores and 16gb of ram, so I would say it can handle 4-6 simultaneous inference …

Hope this helps.

Thank you for quick response.

I see that it depends on many things.

And I have another questions in this case. For example I have 10000 different vocabulary. Each has from 10-100 words. Does it make sense to create specific model for each vocabulary? Because some vocabulary can be using pretty often but other not so often.

Could you tell me what kind of service you are trying to build ? Who are the ones who are using it and how ?

Sure.
This is service for local schools.
We trying to improve kids speech who has hard accent or speech impediment.

It’s some kind of interactive game if kid says word correct he will see green background if wrong background will be red. Pretty simple.

For now we have 100 topics with pretty much unique words (and phrases based on them) kids need to improve.

Ok, I would start by making one big model and see how it performs. During that training you will face many problems I can tell you, but one rule of thumb is just to pour more data and play around with learning rate and amount of hidden layers. Deepspeech can handle it.

Actually my opinion is that Deepspeech is currently the best choice. Of course there are Kaldi, and Sphinx etc, but they are more complicated to set up.

We actually have tried Kaldi but it has pure performance with concurrent requests.
Right now we are on Deepspeech and wav2letter, last one complicated to set up for now.

And one more question, we want to use Deepspeech 5 in case of use metadata (confidence rate) is any tutorial how to train model for this specific version?

And one more question, we want to use Deepspeech 5 in case of use metadata (confidence rate) is any tutorial how to train model for this specific version?

Sorry, I dont know. Let me know how your wav2letter is doing. I checked it and it looked interesting. Havent heard it before.

Hi,

your approach sounds very interesting.
Did you already test the feedback of the model for your use-case.

I would be afraid that with a limited vocabulary it is quite easy for the model to predict what your users tried to say.
Do you have some way to train whether the pronunciation was good or not?
If not, which measure would you use to rate the pronunciation?

We just play around with it, one thing I can tell you for sure it performs so good.
Here is comparison

We have plan to use different types of audios with good accent/pronounce and bad.

But for now we want to use new Deepspeech metadata and confidence score.

that sounds like a good approach, thanks for the info;
Will you create that dataset yourself or do you know of any open datasets?

Ok, sounds nice! Only thing which keeps me trying it is preprocessing of training data. How you managed preprocess it fast to move between Deepspeech and wavtoletter ? Training data formats are quite different …

Hi, has anyone used Deepspeech for production from AWS?

I’m planning to use c5 instances, which features the Intel Xeon Platinum 8000 series processors. But they do not have any GPUs. Would it be advisable to use such instances for Deepspeech production?