Train Fench common voice data set

AneOn · March 27, 2019, 2:52pm

Hi,

I try to train the french common voice data set but the result is very bad.

My command
./DeepSpeech.py --dev_files /data/clips/dev.csv --test_files /data/clips/test.csv --train_files /data/clips/train.csv --train_batch_size 80 --dev_batch_size 80 --test_batch_size 40 --n_hidden 375 --epoch 100 --dropout_rate 0.22 --learning_rate 0.00095 --report_count 100 --use_seq_length False --checkpoint_dir /data/checkpoints --export_dir /data/models --alphabet_config_path /data/alphabet.txt 2>&1 | tee output.log

Result
WER: 1.000000, CER: 23.000000, loss: 2.552169

src: “je le retire monsieur le président”
res: “jelâreziâierezi”

WER: 1.000000, CER: 7.000000, loss: 2.598004

src: “quel aveu”
res: “lee”

WER: 1.000000, CER: 70.000000, loss: 2.705939

src: “je suis saisi de deux amendements identiques numéros deux cent quarantecinq et trois cent soixantecinq”
res: “jeiâieeâeneneâiqerâenqârâneineinziâneinq”

WER: 1.000000, CER: 8.000000, loss: 2.796613

src: “rouge vif”
res: “i”

WER: 1.000000, CER: 11.000000, loss: 2.822357

src: “quatre grands”
res: “ree”

WER: 1.000000, CER: 11.000000, loss: 2.837523

src: “ça vous plait”
res: “all”

WER: 1.000000, CER: 14.000000, loss: 2.878527

src: “vous l’avez remarqué”
res: “nlâzezeârq”

One more thing, how to create a vocabulary.txt with this kind of dataset ?

thanks

lissyx · March 27, 2019, 6:45pm

I would not expect good results with that kind of settings. If you are interested in french, please join https://github.com/Common-Voice/commonvoice-fr and https://github.com/mozfr/besogne/wiki/Common-Voice-Fr

AneOn · March 28, 2019, 8:53am

Thank you, but how to calculate the appropriate value ?

lissyx · March 28, 2019, 3:21pm

I’m working using same dimensions as English released model, 2048

AneOn · April 12, 2019, 12:32pm

Same result with ./DeepSpeech.py --dev_files /data/clips/dev.csv --test_files /data/clips/test.csv --train_files /data/clips/train.csv --train_batch_size 80 --dev_batch_size 40 --test_batch_size 40 --n_hidden 2048 --epoch 100 --dropout_rate 0.30 --learning_rate 0.0001 --report_count 100 --use_seq_length False --checkpoint_dir /data/checkpoints --export_dir /data/models --alphabet_config_path /data/alphabet.txt 2>&1 | tee output.log

lissyx · April 12, 2019, 12:35pm

It depends so much on a lot of details that you don’t share. Please join the existing efforts to produce a model …

AneOn · April 12, 2019, 1:28pm

I use the french common voice data set (74 hours) that i previously transformed with the command import_cv2.py. My alphabet is

### Reading in the following transcript files: ###
### ['/data/clips/train.csv'] ###
### The following unique characters were found in your transcripts: 
###
â
z
'
!  
œ
ë
í
g
q
=
ê
n
l
°
ñ
)
a
r 
î
i
ç
e
—
ù
j
y
ï
á
…
½
«
û
w
;
p
’
é
/
ô
ö
ÿ
à

d
:
x
h
u
b
k
ü
º
»
–
s
è
v
m
c
o
t
f
### ^^^ You can copy-paste these into data/alphabet.txt ###

My container

# git-lfs repository
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash

# Dependencies 
RUN apt-get update && \
    apt-get install -y \
        sox \
        libsox-fmt-mp3 \
        git-lfs \
        libboost-all-dev \
        cmake \
        zlib1g-dev \
        libbz2-dev \
        liblzma-dev
        
RUN git clone https://github.com/mozilla/DeepSpeech.git /var/lib/deepspeech

WORKDIR /var/lib/deepspeech

# Install kenlm
RUN wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz && \
    mkdir kenlm/build && \
    cd kenlm/build && \
    cmake .. && \
    make -j 4

RUN git lfs install && \
    pip3 install -r requirements.txt && \
    python3 util/taskcluster.py --target . && \
    pip3 install $(python3 util/taskcluster.py --decoder)

VOLUME ["/data"]

lissyx · April 12, 2019, 1:41pm

Well, that’s not super surprising, you are on an old version of the dataset, and there is work that we need to do to improve its quality. Again, please join the efforts I linked above, there’s no point in everyone re-doing the same work and hitting the same issues again and again, efforts needs to be shared.

AneOn · April 12, 2019, 2:01pm

So, this dataset is unusable? what a pity!

Thank you for your help lissyx.

lissyx · April 12, 2019, 2:04pm

Please be mindful, I never said it was “unusable”. We are just at the beginning, so some work is needed. Common Voice is intended not only for DeepSpeech, so there is cleanup we cannot do on Common Voice but that needs to be done when training with DeepSpeech.