New project: deepspeech websocket server & client

DeepSpeech WebSocket Server

This is a WebSocket server (& client) for Mozilla’s DeepSpeech, to allow easy real-time speech recognition, using a separate client & server that can be run in different environments, either locally or remotely.

Work in progress. Developed to quickly test new models running DeepSpeech in Windows Subsystem for Linux using microphone input from host Windows. Available to save others some time.

Features

  • Server
    • Streams raw audio data from client via WebSocket
    • Streaming inference via DeepSpeech v0.2+
    • Single-user (issues with concurrent streams)
  • Client
    • Streams raw audio data from microphone to server via WebSocket
    • Voice activity detection (VAD) to ignore noise and segment microphone input into separate utterances
6 Likes

Nice, just be ready, 0.3.0 is coming :slight_smile:

Awesome :slight_smile:
Might be a great way to explore possibilities/use deep seach for research projects on mobile devices until there are stable enough ports for Android and iOS.

Looks like a really good idea.

Have you had any success with the client on Linux at all?

I have run into various audio issues with PyAudio, on both my Arch Linux laptop and also on a Raspberry Pi (which has a Matrix Voice hat for the microphone). I can post more detail later (it’s late now!) but thought I’d check if anything like either environment had been successful for you?

You mention you were running on Windows host, so maybe it’s less fiddly there than I’m finding audio on Linux :slightly_smiling_face:

Thanks!

I admit my usage is for the client running on Windows, where pyaudio installed from binary wheels couldn’t be easier.

I haven’t used pyaudio on your 2 platforms, but it worked fine for me on Ubuntu 18.04 recently, once I installed the portaudio19-dev headers and added my user account to the audio group.

Thanks @daanzu. I managed to get it working - the microphone wasn’t set up right in PulseAudio and once I got that right (plus figured out a small issue with my laptop’s firewall!) I managed to get it working between two computers, both running Arch Linux.

It looks like it’ll be v useful - thanks again for putting this great project out there :slight_smile:

1 Like

@daanzu: I am trying to make the setup for the server on Ubuntu. But when I tried running the command, I got the below error. Please advice.

/deepspeech-websocket-server$ python server.py --model …/models/daanzu-6h-512l-0001lr-425dr/ -l -t
Traceback (most recent call last):
File “server.py”, line 4, in
from bottle import get, run, template
ImportError: No module named bottle

The requirement is already installed, but I am getting the same error.

/deepspeech-websocket-server$ pip install bottle
Requirement already satisfied: bottle in /usr/local/lib/python3.5/dist-packages
You are using pip version 9.0.1, however version 18.1 is available.
You should consider upgrading via the ‘pip install --upgrade pip’ command.

@daanzu: Also, I am facing installation issues of client on windows. I tried googling it but not much success.

I get the below error on running (pip install -r requirements-client.txt)

src/_portaudiomodule.c(29): fatal error C1083: Cannot open include file: ‘portaudio.h’: No such file or directory
error: command ‘C:\Program Files (x86)\Microsoft Visual Studio\2017\BuildTools\VC\Tools\MSVC\14.16.27023\bin\HostX86\x64\cl.exe’ failed with exit status 2

These appear to be general Python installation/configuration issues. On Ubuntu, python isn’t seeing the installed package; and on Windows, pip should be getting the binary wheel for pyaudio and not need to compile. Do other python scripts work? I’d suggest pursuing general python support resources.

@daanzu: Thanks for the reply. I am able to run the setup, with pre-trained model available at https://github.com/mozilla/DeepSpeech/releases/download/v0.3.0/deepspeech-0.3.0-models.tar.gz

Do I need to use model/daanzu-6h-512l-0001lr-425dr for the setup?

My client is running on Windows and Server on Ubuntu. Can you please point where I need to make changes, so that client can send the input (.wav) to the server and server can send it back to client (text) (ip for client and server)?

I tried below command, but I don’t see any audio getting saved in C directory. Please refer the screenshot.

The model/daanzu-6h-512l-0001lr-425dr is just my model directory for testing. Just pass any model directory to use, like the pre-trained model, and use the default names. Or pass each parameter/filename individually.

Currently, the client just listens to the microphone for audio. It would be easy to modify it to read wav files, though: just add wav-file reading code to consumer function. The protocol is dead simple, so it’d be easy to just write a new client, too.

Your command looks good; but the absolute Windows path might be getting parsed wrong. Try “.” for the current directory, or any relative path. It should show a spinner when it hears audio on the microphone, too.

@daanzu: I also want real-time streaming of audios. So, I want the client listen to the microphone. I made the changes but somehow I still son’t see any spinner coming up when I speak. It’s dead. Any help?

What I understood till now, is the client will listen the microphone and will create the WAV file and this file will be routed to the server to text transformation. So, I believe I don’t have to make any changes to the client code.

Hi ,
i get below error while running client file -

Connecting to ‘ws://localhost:8080/recognize’…
ALSA lib pcm_dmix.c:1052:(snd_pcm_dmix_open) unable to open slave
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.rear
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.center_lfe
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.side
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.surround71
ALSA lib setup.c:547:(add_elem) Cannot obtain info for CTL elem (MIXER,‘IEC958 Playback Default’,0,0,0): No such file or directory
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.hdmi
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.modem
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib pcm.c:2495:(snd_pcm_open_noupdate) Unknown PCM cards.pcm.phoneline
ALSA lib pcm_dmix.c:1052:(snd_pcm_dmix_open) unable to open slave

Is this something related to my configuration ? Any help please.

Thanks!

It uses pyaudio, which uses portaudio. Try searching for that, because the problem appears to be of general audio.

thanks @daanzu , It seems to be my system’s microphone issue . Once i get that resolve , i will update the progress.

Thanks!

Hi,
i am using this to deploy deepspeech on server. but after server starts nothing is shwoing in browser after reaching that port.

(deepspeech-train-venv) yk@andromeda:~/deepspeech-websocket-server$ python server.py --model /home/yk/nestle_project/DeepSpeech/data/model/
Initializing model...
2019-10-22 AM 11:37:55.708: __main__: INFO: <module>(): ARGS.model: /home/yk/nestle_project/DeepSpeech/data/model/output_graph.pb
2019-10-22 AM 11:37:55.708: __main__: INFO: <module>(): ARGS.alphabet: /home/yk/nestle_project/DeepSpeech/data/model/alphabet.txt
TensorFlow: v1.13.1-10-g3e0cc53
DeepSpeech: v0.5.1-0-g4b29b78
Warning: reading entire model file into memory. Transform model file into an mmapped graph to reduce heap usage.
2019-10-22 11:37:55.709031: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2019-10-22 11:37:55.808088: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "CPU"') for unknown op: UnwrapDatasetVariant
2019-10-22 11:37:55.808117: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: WrapDatasetVariant
2019-10-22 11:37:55.808123: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "WrapDatasetVariant" device_type: "CPU"') for unknown op: WrapDatasetVariant
2019-10-22 11:37:55.808214: E tensorflow/core/framework/op_kernel.cc:1325] OpKernel ('op: "UnwrapDatasetVariant" device_type: "GPU" host_memory_arg: "input_handle" host_memory_arg: "output_handle"') for unknown op: UnwrapDatasetVariant
Bottle v0.12.17 server starting up (using GeventWebSocketServer())...
Listening on http://127.0.0.1:8080/
Hit Ctrl-C to quit.

is it possible to use for model train with different language rather than English language. I am training network with other language and I want to use.

Yes, that should work (although I haven’t tried this).

So long as you can get your model working on the server side then I’ll pretty sure it’ll be fine. That would be very similar to running regular inference on your model.

I don’t think the VAD functionality in the client would be impacted by use with other languages (from a quick Google I couldn’t see anything suggesting it was English only and it seems common sense that detecting voiced vs non-voiced wouldn’t be strongly influenced by the language being spoken)

Therefore I’d suggest giving it a go. Would be great if you could give feedback here on your progress so others know it works.

I haven’t tried it, but I think it should work. Feel free to let me know.