Very poor performance on 8khz data

Hi everyone

I have eventually succeeded in training my model on Russian/Kazakh mixed data.
However, I got very high WER.
I suppose that the main reason is the usage of 8KHZ instead of 16khz.
What do you think about this? Any comments?

This is what I got at the end of the training session:

I Training of Epoch 99 - loss: 2.772892
I FINISHED Optimization - training time: 1:20:22
100% (499 of 499) |######################| Elapsed Time: 0:29:22 Time: 0:29:22
Preprocessing [’/home/dulan/data/test.csv’]
Preprocessing done
Computing acoustic model predictions…
100% (285 of 285) |######################| Elapsed Time: 0:08:37 Time: 0:08:37
Decoding predictions…
100% (285 of 285) |######################| Elapsed Time: 0:07:52 Time: 0:07:52
Test - WER: 0.993192, CER: 66.797807, loss: 326.212860

Train: 15,000 wav files
Dev: 4000 wav files
Test: 1500 wav files

The run command:
nohup python -u DeepSpeech/DeepSpeech.py --train_files /home/dulan/data/train.csv --dev_files /home/dulan/data/dev.csv --test_files /home/dulan/data/test.csv --train_batch_size 8 --dev_batch_size 8 --test_batch_size 8 -alphabet_config_path /home/dulan/models/alphabet.txt --lm_binary_path /home/dulan/models/lm.binary --lm_trie_path /home/dulan/models/trie --epoch 100 --display_step 1 --export_dir /home/dulan/models --learning_rate 0.000025 --dropout_rate 0 --word_count_weight 3.5 --log_level 1

Also, how we can configure deepspeech 16khz default input to 8KHZ ?

@lissyx (BTW thanks for last comments)

Best

  • How did you convert the 8KHz data to 16KHz?
  • Converting entire Deep Speech pipeline to use 8KHz is harder than converting your 8KHz data to 16KHz. So I’d suggest converting your 8KHz data to 16KHz.

It is not difficult to convert 8khz to 16khz, but it will cause some noise on 16khz data.

How is it possible to reconfigure deepspeech to enable it work on 8khz data?

I know it’s easy to convert from 8KHz data to 16KHz and I also know that there is a bug in the standard python “audioop” package that does this conversion. This is why I asked how you did the conversion.

So I want to ask again: How did you convert the 8KHz data to 16KHz?

using ffmpeg bash command

OK.

Just looked at the size of your training set and it looks small.

To give you a better picture of the scale of data required, we train on about 1.5M wav files each of which is about 5-8 seconds long. So I think 15K files, unless they are very long files, is not enough to train on.

Out of curiosity what did the final loss look like for your test and dev sets? For the train set 326.212860 seems very high. I’m guessing you’re overfitting the training set with 100 epochs and not generalizing well.