Thanks for sharing such a wonderful article … but can you please share a snapshot of your csv as i am confused that do we need to give the full path of the wav files or only their name
@gr8nishan,
Thanks for compliments.
here is a sample of a typical deepspeech csv file :
wav_filename,wav_filesize,transcript
/home/nvidia/DeepSpeech/data/alfred/dev/record.1.wav,87404,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/dev/record.2.wav,101804,quel est ton nom ou comment tu t'appelles
/home/nvidia/DeepSpeech/data/alfred/dev/record.3.wav,65324,est-ce que tu vas bien
You must respect the first line (needed to create columns for CSV usage)
And each next line inform 3 values, separated by a comma :
- where is the wav file, (I use complete link, perhaps relative path could work ?!)
- what is it size, (you can have size with this : os.path.getsize(“the wav file”))
- what is the transcript (in the wav language)
Take a look at …DeepSpeech/bin/import_ldc93s1.py, L23 for CSV creation !!
About transcript, pay attention to only enter characters present in alphabet.txt, otherwise you’ll encounter errors when training.
Hope it will help you.
Vincent
@elpimous_robot
but i have more than 16000 file wav. how can i write in csv file.
we can follow the same DeepSpeech/bin/import_ldc93s1.py to do write in csv file. That right ?
Thanks for the help when i was trying from relative path it was not working for me but giving the full absolute path worked
@gr8nishan, thanks for info !
@phanthanhlong7695, try this :
save it in a python file :
run it as python2, and follow asks !! You’ll have nice finished CSV file !
if python3, you’ll have some minor changes to do !
when asked for prefix, enter only prefix wav (all before numbers)
ex : audio223 -> audio ; audio.223 -> audio.
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import os
import fnmatch
print('\n\n°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°° ')
print(' CSV creator : ')
print(' ------------- ')
print(' - adding CSV columns, ')
print(' - files location, bytes size, and transcription. ')
print(' Vincent FOUCAULT, Septembre 2017 ')
print('°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°°\n\n')
def process():
directory = raw_input('Paste here the location of your wavs:\n>> ')
directory = directory.replace('file://','')
textfile = raw_input('Paste here the location of your transcript text:\n>> ')
textfile = textfile.replace('file://','')
sentenceTextFile = open(textfile, 'rb')
sentences = sentenceTextFile.readlines()
csv_file = raw_input('Paste here the complete CVS file link:\n>> ')
csv_file = csv_file.replace('file://','')
transcriptions = open(csv_file, 'wb')
wavDir = directory
wav_prefix = raw_input('Enter the prefix of wav file (ex : if record.223.wav --> enter "record.") :\n>> ')
wavs = directory+"/"+wav_prefix
print('\n******************************************************************************************')
print('your wav dir is : '+directory)
print('wave prefix name is : '+wav_prefix)
print('transcript is here : '+textfile)
print('you want to save CSV here : '+csv_file)
print('******************************************************************************************')
content = len(fnmatch.filter(os.listdir(wavDir), '*.wav'))
print('\nNumber of wav found : '+str(content)+'\n')
transcriptions.write('wav_filename,wav_filesize,transcript\n')
for i in range(content):
wavPath = wavs+str(i+1)+'.wav'
wavSize=(os.path.getsize(wavPath))
transcript=sentences[i]
transcriptions.write(wavPath+","+str(wavSize)+','+transcript)
transcriptions.close()
if __name__ == "__main__":
try:
process()
print('---> CSV passed !')
print('\n\n ---> Bye !!\n\n')
except:
print('An error occured !! Check your links.')
print('GOOD LUCK !!')
Here is the terminal result :
your wav dir is : /media/nvidia/neo_backup/DeepSpeech/data/alfred/test2/
wave prefix name is : record.
transcript is here : /media/nvidia/neo_backup/DeepSpeech/data/alfred/text2/test.txt
you want to save CSV here : /media/nvidia/neo_backup/DeepSpeech/data/alfred/text2/test_final.csv
Number of wav found : 71
—> CSV passed !
—> Bye !!
Hi Mark,
I ran into the same problem as this. Were you able to find a solution to this??
Prafful’s MacBook Pro:~ naveen$ /Users/naveen/Downloads/kenlm/build/bin/build_binary -T -s /Users/naveen/Downloads/kenlm/build/words.arpa lm.binary
Reading /Users/naveen/Downloads/kenlm/build/words.arpa
----5—10—15—20—25—30—35—40—45—50—55—60—65—70—75—80—85—90—95–100
/Users/naveen/Downloads/kenlm/lm/vocab.cc:305 in void lm::ngram::MissingSentenceMarker(const lm::ngram::Config &, const char *) threw SpecialWordMissingException.
The ARPA file is missing and the model is configured to reject these models. Run build_binary -s to disable this check. Byte: 191298
ERROR
How did you record your arpa ?
/bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3
Hi!
I have quite vague understanding what caused that error in my case. I think something related to wrong characters or wrong encoding. But I fixed the problem by filtering out from the vocabulary all characters that are not present in my alphabet.
In Python something like that:
PERMITTED_CHARS = "1234567890abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ "
new_data = “”.join(c for c in data if c in PERMITTED_CHARS)
I am trying this process on macos. I have got everything done except the trie file. When i am trying to generate the trie file, i am getting this error using the details provided:-
“cannot execute binary file”
when i searched this error, i see that its a linux file. is it so??
Can anyone help me out?
btw, this is what i am running:
/Users/naveen/generate_trie / /Users/naveen/Downloads/DeepSpeech/alphabet.txt / /Users/naveen/Downloads/DeepSpeech/lm.binary / /Users/naveen/Downloads/DeepSpeech/vocabulary.txt / /Users/naveen/Downloads/DeepSpeech/trie
yup, like this only. Finally, this got resolved when i did " Run build_binary -s to disable this check. " as suggested
Hey, thank you for the tutorial , it’s really helpful.
I have been trying to train a french model using this data. https://datashare.is.ed.ac.uk/handle/10283/2353
i divided the data 6800 files training, 1950 dev, 976 test.
i followed all your steps, but the loss is really high and it doesn’t decrease much , it doesn’t go below 160 , and if i enabled the early stop it would stop at 46 epochs
any thoughts ?
I think the problem was with the frequency of the files. they were in (41000 Hz) and i converted them to (16000 Hz) and it works better now.
Very good…
And wav must be correctly
Sampled :
Ex : test
Wav on audacity / it should reach ±0.5 amplitude…
The max (±0.5) the better for training.
Ps: what is your total wav duration for french ??
it’s a about ten hours. i’m facing another problem. the ten hours are for the same female voice. when i tried to use other recordings for a different male person, it didn’t work. is the model sensitive to the voice itself ?
No. The computer does t mind !!
It should be a wav format error, or some alphabet changes (or csv)
maybe i wasn’t so clear, i trained with female voice only, and tried to test with male voice and different tune , but it didn’t give a good output (random text)
Ah… not same !!!
Normal.
The model only knows this girl voice !!
This is why we need a max of different speakers, to let the model try to anderstand an unknown one (principle of this deep learning!)
Hope this will help.
yes, thank you . i will try to have more data and different speakers. thank you again
Do it right…perhaps I’ll ask you to test your model !! LOL
I don t see wave length on your link !!
Do you know the total wave length on the website, for french ?
Hi there,
Just for the testing, i have only one sentence in my vocabulary file (vocabulary.txt), and i use kenlm tool to generate the apra file. But, its taking too long to generate apra file. Is it usual.
Here’s my command line on kenlm/build directory,
(sr_env) jugs@jugs:~/PycharmProjects/DeepSpeech/native_client/kenlm/build$ bin/lmplz -o 5 ~/Desktop/jugs_lm/vocabulary.txt ~/Desktop/jugs_lm/out.arpa
and the running process shows,
=== 1/5 Counting and sorting n-grams ===
File /dev/pts/23 isn’t normal. Using slower read() instead of mmap(). No progress bar.