Creating DeepSpeech Model for Hindi

cryptoaimdy · October 14, 2019, 1:33pm

Created a model for Hindi,
after training my data, at the test steps i get an error

I Restored variables from best validation checkpoint at 
   hindi_checkpoint/best_dev-90, step 90
 Testing model on data/test/test.csv
Test epoch | Steps: 0 | Elapsed Time: 0:00:00                                                                                                                                               
 Fatal Python error: Segmentation fault

What could be the problem?

lissyx · October 14, 2019, 1:37pm

Without more context, it’s going to be hard … How have you setup things ?
I had a similar crash, resolved by re-creating the virtualenv from scratch …

cryptoaimdy · October 14, 2019, 1:50pm

Well, recorded few audios of mine and prepared the datasets.

step 1: prepared datasets, vocabs, alphabet.txt created arpa and then lm.bianry with hindi vocabs using kenlm
step 2: using native client bazel build created trie file using lm.binary

step 3: cloned deepspeech and gitcheckout 0.5.1

placed my trie and binary into data/lm/

running deepspeech.py , but after training i am getting segment error at testing steps:

is it because of low data in test?

lissyx · October 14, 2019, 2:35pm

Unlikely. You have not documented anything on how you did setup virtualenv … Did you read my reply ?

cryptoaimdy · October 14, 2019, 3:14pm

Okay ll try again creating venv from scratch.

cryptoaimdy · October 15, 2019, 5:28am

Started from scratch creating venv. again same error

lissyx · October 15, 2019, 10:56am

Well, sorry, but with so much details, I don’t even can try to reproduce …

lissyx · October 15, 2019, 11:03am

@cryptoaimdy Seriously, I would like to help you, but you keep continuously not sharing your complete STR. This is making both of us loose valuable time. So once again, share detailed and complete STR of everything you do to reproduce the issue. And try to reproduce with our default data (LDC93S1, english model and LM and alphabet, our native client build), to make sure this is not something from there.

There are 10+ variables here in play, we can’t do divination from a single segfault.

cryptoaimdy · October 15, 2019, 11:16am

Solved it today morning. Model is ready and giving loss rate like 60 average. Now creating more hindi datasets.

lissyx · October 15, 2019, 11:57am

Do you care sharing what was the solution ?

cryptoaimdy · October 15, 2019, 12:03pm

As you suggested i started from scratch setting up venv.

According to me i think the problem was a version mismatch(binary version and DeepSpeech version). because earlier i created the lm binary and trie without using virtual env and deepspeech i was running using venv. so looks like a version mismatch.

lissyx · October 15, 2019, 12:06pm

Right, thanks, at least it confirms my first assumption. Glad to see it is working now.

cryptoaimdy · October 15, 2019, 12:09pm

Yes, setup is all what we need to do carefully.

But, for hindi my src " " part is not accurate. while testing the src is having sentences other than my original test.csv, the sentence in src is not making even sence. its kind of ‘abcsdksbfdfa’(hindi language abcd)

cryptoaimdy · October 15, 2019, 12:09pm

WER: 0.000000, CER: 0.000000, loss: 7.111163
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/010.wav
 - src: "ापखी खखैुेी"
 - res: "ापखी खखैुेी"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 8.599924
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/008.wav
 - src: "पेांीखतहाखाख खैुेी"
 - res: "पेांीखतहाखाख खैुेी"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 11.154081
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/007.wav
 - src: "ुबखंेदयदी"
 - res: "ुबखंेदयदी"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 12.389311
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/001.wav
 - src: " तपुबखैीबखुखतखप"
 - res: " तपुबखैीबखुखतखप"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 12.756799
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/006.wav
 - src: "बडैखखुखतखैयीखखपर"
 - res: "बडैखखुखतखैयीखखपर"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 17.487480
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/003.wav
 - src: "ाुखतरखदेतदीधीखुखाहखाख खैुखुपखखपन"
 - res: "ाुखतरखदेतदीधीखुखाहखाख खैुखुपखखपन"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 18.130671
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/002.wav
 - src: "तरखबडैीखैीखखापखी खखैुखाै खखपन"
 - res: "तरखबडैीखैीखखापखी खखैुखाै खखपन

cryptoaimdy · October 15, 2019, 12:11pm

like this i am getting, it is because of low data? i think src should be displayed as it is.

lissyx · October 15, 2019, 12:13pm

I don’t understand, it looks like you have src == res, which would mean computed transcription matches expected transcription.

cryptoaimdy · October 15, 2019, 12:16pm

WER: 1.000000, CER: 0.600000, loss: 33.510384
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/009.wav
 - src: "ैयीखखाखदुखखौखप हखपरा"
 - res: "पखाख खपरा"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 6.128268
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/004.wav
 - src: "ुखत"
 - res: "ुखत"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 6.713559
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/005.wav
 - src: "ापखी ख"
 - res: "ापखी ख"
--------------------------------------------------------------------------------
WER: 0.000000, CER: 0.000000, loss: 7.111163
 - wav: file:///home/yk/hindi-deep/DeepSpeech/data/test/010.wav
 - src: "ापखी खखैुेी"
 - res: "ापखी खखैुेी"
-------------------------------------------------------------------------

at first i got the WER 1

lissyx · October 15, 2019, 12:17pm

Well, this is the test set showing worst examples. I don’t see anything strange, please elaborate.

cryptoaimdy · October 15, 2019, 12:19pm

wav_filename,wav_filesize,transcript
001.wav,101000,तमहरआ कयआ नाम ह
002.wav,138000,मै आपकी कया सहायता कर सकता हू
003.wav,125000,सर मै विमवीशयोर से बात कर रहा हू
004.wav,78000,नाम
005.wav,99400,सहायता
006.wav,106000,आपका नाम क्या है
007.wav,80900,नई दिल्ली
008.wav,90700,हिंदी में बात करिए
009.wav,81900,क्या बोलना चाहते हैं
010.wav,88700,सहायता करिए

this is my original test.csv content match it with the src its totally different.

cryptoaimdy · October 15, 2019, 12:20pm

this is the original transcript

this is what i get in src