Key error "\u200d"

Shruthi_Sridhar · December 1, 2019, 5:47pm

I am fine tuning DeepSpeech pretrained models v0.5.1 with Hindi dataset.
Added alphabet.txt with the unique characters got from util/check_characters.py. but i get the below error:

Key error “\u200d”
Your transcripts contain characters which do not occur in data/alphabet.txt! Use util/check_characters.py to see what characters are in your {train,dev,test}.csv transcripts, and then add all these to data/alphabet.txt.’

\u200d is zero width nonjoiner and \u200c is zero width joiner.They are non printable characters.
How do I overcome this error?

alchemi5t · December 1, 2019, 6:05pm

Just remove the zwj and zwnj from the transcripts. 200d and 200c are not used consistently and should be cleaned. It’s only used for graphical purposes and does not add anything to the data other than noise.

Write a script to remove all zwj and zwnj. You could tweak the check_characters script. Making a cleaner to allow only the characters in devanagari range should work best.

lissyx · December 1, 2019, 6:20pm

You cannot change alphabet when you just re-use the pretrained model to perform extra tuning. The transfer-learning2 branch might allow that, but it’s a bit outdated and not 0.5.1.

Shruthi_Sridhar · December 2, 2019, 4:51am

@lissyx Thanks for the reply. I have around 16 hours of Hindi data.I had added the hindi unique characters along with english alphabets in alphabets.txt for finetuning. Since you mentioned that changing of alphabet cannot be done, Could you kindly suggest 1) Is it possible to finetune a Non english dataset with the pretrained model?If yes,could you give any links which are helpful
2)Or should I have to train my model from scratch?

Shruthi_Sridhar · December 2, 2019, 4:51am

@alchemi5t Thank you for the reply.I will try to remove it from the transcripts.

lissyx · December 2, 2019, 10:09am

You won’t get anything usable from 16 hours of audio.

I just told you how …