Hi
I’m trying to build a Portuguese DeepSpeech model. I found a Portuguese corpus in Voxforge and modified import_voxforge.py to download and process it.
But the process does not replace the special characters with the nearest ascci character, instead replace with spaces.
I understand that the issue is in:
def _generate_dataset(data_dir, data_set):
extracted_dir = path.join(data_dir, data_set)
files = []
for promts_file in glob(path.join(extracted_dir+"/*/etc/", "PROMPTS")):
if path.isdir(path.join(promts_file[:-11],"wav")):
with codecs.open(promts_file, 'r', 'utf-8') as f:
for line in f:
id = line.split(' ')[0].split('/')[-1]
sentence = ' '.join(line.split(' ')[1:])
# sentence = re.sub("[^a-z']"," ",sentence.strip().lower())
sentence = re.sub("[^a-zàâäôéèëêïîçù']"," ",sentence.strip().lower())
transcript = ""
for token in sentence.split(" "):
word = token.strip()
if word!="" and word!=" ":
transcript += word + " "
transcript = unicodedata.normalize("NFKD", transcript.strip()) \
.encode("ascii", "ignore") \
.decode("ascii", "ignore")
wav_file = path.join(promts_file[:-11],"wav/" + id + ".wav")
if gfile.Exists(wav_file):
wav_filesize = path.getsize(wav_file)
# remove audios that are shorter than 0.5s and longer than 20s.
# remove audios that are too short for transcript.
if (wav_filesize/32000)>0.5 and (wav_filesize/32000)<20 and transcript!="" and \
wav_filesize/len(transcript)>1400:
files.append((path.abspath(wav_file), wav_filesize, transcript))
where I replaced the original with
sentence = re.sub("[^a-zàâäôéèëêïîçù']"," ",sentence.strip().lower())
,
and is suppose that
unicodedata.normalize("NFKD", transcript.strip()) \
.encode("ascii", "ignore") \
.decode("ascii", "ignore")
should replace the non standard ascii with the nearest character and get ride with the rest.
It does not happen.
For example, one original sentence in the corpus (from the PROMPT file) is:
anonymous-20121216-xhy/mfc/088 ONDE FICA A ESTAçãO DO TREM
and the corresponding line in voxforge-dev.csv is
~/dev/anonymous-20121216-xhy/wav/088.wav,124044,onde fica a esta o do trem
I searched the web and tried with unidecode with no luck
I will appreciate any suggestion