Python error: Segmentation fault when training

We also moved from TensorFlow r1.13 to r1.14, so it could still be consistent.

Could you isolate one sentence that repro the issue, and try to find if we can track that down to one character ?

That’s what I’m trying to do.

I have a list of original ‘bad files’ but that were usable after conversion so will see if the issue is caused by:
a) Bad conversion
b) Bad transcript

Also, could you try running with LC_ALL=C

Sure, is this a DeepSpeech.py or compilation flag?

We also moved from TensorFlow r1.13 to r1.14, so it could still be consistent.

Good to note. Will document reproducible environment.

Environment flag: LC_ALL=C python DeepSpeech.py [...]

1 Like

I was able to train a model on 0.6.0 with our arabic character encodings. It looks like there was an issue with encoding in the transcript of the CSV files.

Thanks for all the help @lissyx!

2 Likes

Hi Anas,

Thanks for sharing your work in Github. I see that your .csv files separate the Arabic letters. I also see that you use the the underscore character “_” to separate between two words. However, the language model is trained on ordinary written Arabic text where space is used to separate any two words and most letters are not separated. I know that Deepspeech uses the language model to correct mistakes in spellings and grammar in the transcription. So that means, the output of your model needs to be transformed into original Arabic text before using the language model to correct it. Right? Please correct me if I am wrong. I am also wondering what could go wrong if you just write ordinary Arabic text in your “.csv” files?

Do you mean there’s a discrepency between how the acoustic model learnt spaces and how the language model is built ? If my understanding is correct, I would say that you are right: the usage needs to be consistent between both.

1 Like

Thanks lissyx for your reply! Yes, this is exactly what I meant.

@moh.ahm.abdelraheem @lissyx

Thanks for the info! Yes that was one of the issues we came across (and one of the reasons we were getting a warning about the CTC feature length being shorter than transcription length).

So far we have:

a. Removed diacritics from both LM and transcription
b. Checked CTC feature length with transcription length
c. Removed spaces between letters.

Regarding (b), @lissyx there was some code that checked for that but apparently it was somewhat refactored and moved in 0.6.0? Do you have any context on why that is and whether we can include that check when preprocessing the data (i.e. when checking the alphabet characters).

Why ?

No, there was never code for that. It’s importer-level that this check is performed.

Regarding (a), we wanted to start simple and work our way towards diacritics. We’ll add them after we’ve found suitable training parameters with a decent workflow.

Regarding (b), would a contribution to add this check be welcome or is that out of scope for the project? I’m not familiar with the deepspeech development roadmap.

This makes no sense. You modified the dataset, your training parameters will not be the same.

Did you had any issue with some characters ?

As I said, it’s already implemented in all the importers.