Streaming API on mac os x

erdoc · May 3, 2019, 8:37pm

Trying to replication ruben’s small python program: https://hacks.mozilla.org/2018/09/speech-recognition-deepspeech/ in macos mojave terminal.

libsox installed and I have access to microphone from terminal. inside a python 3.7.0 virtual environment.

program runs without errors but the output is only “Transcription: … BLANK”. It appears model.finishStream(sctx) doesn’t output anything.

I have ensured the mic is working by changing the rec parameter -q to -S and V3, the paths to the model, LM and trie files are all correct (DeepSpeech works when called from the cmd line and supplied an audio file argument).

Lastly this is the console output:

> python test.py --model models/output_graph.pbmm --alphabet models/alphabet.text --lm models/lm.binary --trie models/trie

Initializing model...
TensorFlow: v1.12.0-10-ge232881c5a
DeepSpeech: v0.4.1-0-g0e40db6
2019-05-03 15:14:11.615995: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
You can start speaking now. Press Control-C to stop recording.
rec:      SoX v
rec WARN formats: can't set sample rate 16000; using 44100
rec WARN formats: can't set 1 channels; using 2

Input File     : 'default' (coreaudio)
Channels       : 2
Sample Rate    : 44100
Precision      : 32-bit
Sample Encoding: 32-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no

Output File    : '-' (raw)
Channels       : 1
Sample Rate    : 16000
Precision      : 16-bit
Sample Encoding: 16-bit Signed Integer PCM
Endian Type    : little
Reverse Nibbles: no
Reverse Bits   : no
Comment        : 'Processed by SoX'

rec INFO sox: effects chain: input        44100Hz  2 channels
rec INFO sox: effects chain: gain         44100Hz  2 channels
rec INFO sox: effects chain: channels     44100Hz  1 channels
rec INFO sox: effects chain: rate         16000Hz  1 channels
rec INFO sox: effects chain: dither       16000Hz  1 channels
rec INFO sox: effects chain: output       16000Hz  1 channels
In:0.00% 00:00:08.61 [00:00:00.00] Out:137k  [      |      ]        Clip:0    ^C
Aborted.
Transcription:

thank you.

lissyx · May 4, 2019, 1:52pm

Please dump the WAV PCM that you feed to streaming and try it with prebuilt tools. We won’t be able to check if there’s a bug in your code if you don’t share code.

reuben · May 4, 2019, 2:15pm

That doesn’t look like a bug in the client, the sox warnings mean it can’t set the recording sample rate and channel count directly on the hardware/driver, but it does the conversion in software in that case, so you get the proper output in the end, as the table says.

If you record into a WAV file instead of using the streaming API and then feed that WAV file to our deepspeech client, does it work?

erdoc · May 6, 2019, 8:01pm

yes it works using a regular wav file, it is only when streaming it has this problem. And I used the prebuilt tools.

lissyx · May 7, 2019, 11:03am

We would really need to see your code to help you. Could you also share the recording?
Have you dumped the bits you send to streaming API as a wav file and checked this is working as well? You wording above is unclear about that.

erdoc · May 7, 2019, 3:33pm

I don’t have any custom code, I am using the prebuilt deep speech tools for everything. The code I am using is exactly the one Ruben discussed last year. here it is:

import argparse
import deepspeech as ds
import numpy as np
import shlex
import subprocess
import sys

parser = argparse.ArgumentParser(description='DeepSpeech speech-to-text from microphone')
parser.add_argument('--model', required=True,
                    help='Path to the model (protocol buffer binary file)')
parser.add_argument('--alphabet', required=True,
                    help='Path to the configuration file specifying the alphabet used by the network')
parser.add_argument('--lm', nargs='?',
                    help='Path to the language model binary file')
parser.add_argument('--trie', nargs='?',
                    help='Path to the language model trie file created with native_client/generate_trie')
args = parser.parse_args()

LM_WEIGHT = 1.50
VALID_WORD_COUNT_WEIGHT = 2.25
N_FEATURES = 26
N_CONTEXT = 9
BEAM_WIDTH = 512

print('Initializing model...')

model = ds.Model(args.model, N_FEATURES, N_CONTEXT, args.alphabet, BEAM_WIDTH)
if args.lm and args.trie:
    model.enableDecoderWithLM(args.alphabet,
                              args.lm,
                              args.trie,
                              LM_WEIGHT,
                              VALID_WORD_COUNT_WEIGHT)
sctx = model.setupStream()

subproc = subprocess.Popen(shlex.split('rec -q -V0 -e signed -L -c 1 -b 16 -r 16k -t raw - gain -2'),
                           stdout=subprocess.PIPE,
                           bufsize=0)
print('You can start speaking now. Press Control-C to stop recording.')

try:
    while True:
        data = subproc.stdout.read(512)
        model.feedAudioContent(sctx, np.frombuffer(data, np.int16))
except KeyboardInterrupt:
    print('Transcription:', model.finishStream(sctx))
    subproc.terminate()
    subproc.wait()

I didn’t change anything. As for the wav file, I am not that good in linux so I don’t know how to dump the wav file and at the same time PIPE it to deep speech. I did a simple cmd line rec and sent to a wav file, using the --audio arg worked using that wav file.

Just want to help any other person who ever has this issue find out if it is the code or my platform that has a problem. As I said I use a Mac os x sierra.

reuben · May 7, 2019, 3:39pm

This is weird. Just to be sure, did you double check that you’re passing the same parameters to both the microphone script and the client that takes a WAV file? E.g. are you passing the same LM/trie files to both, same model, etc. Dumb question, but just trying to make sure.

Also, could you try updating the LM_WEIGHT and VALID_WORD_COUNT_WEIGHT values in that script to 0.75 and 1.85 respectively? They were updated in the time between the blog post and v0.4.1. Does that change the output?