Train Fench common voice data set

Hi,

I try to train the french common voice data set but the result is very bad.

My command
./DeepSpeech.py --dev_files /data/clips/dev.csv --test_files /data/clips/test.csv --train_files /data/clips/train.csv --train_batch_size 80 --dev_batch_size 80 --test_batch_size 40 --n_hidden 375 --epoch 100 --dropout_rate 0.22 --learning_rate 0.00095 --report_count 100 --use_seq_length False --checkpoint_dir /data/checkpoints --export_dir /data/models --alphabet_config_path /data/alphabet.txt 2>&1 | tee output.log

Result
WER: 1.000000, CER: 23.000000, loss: 2.552169

  • src: “je le retire monsieur le président”
  • res: “jelâreziâierezi”

WER: 1.000000, CER: 7.000000, loss: 2.598004

  • src: “quel aveu”
  • res: “lee”

WER: 1.000000, CER: 70.000000, loss: 2.705939

  • src: “je suis saisi de deux amendements identiques numéros deux cent quarantecinq et trois cent soixantecinq”
  • res: “jeiâieeâeneneâiqerâenqârâneineinziâneinq”

WER: 1.000000, CER: 8.000000, loss: 2.796613

  • src: “rouge vif”
  • res: “i”

WER: 1.000000, CER: 11.000000, loss: 2.822357

  • src: “quatre grands”
  • res: “ree”

WER: 1.000000, CER: 11.000000, loss: 2.837523

  • src: “ça vous plait”
  • res: “all”

WER: 1.000000, CER: 14.000000, loss: 2.878527

  • src: “vous l’avez remarqué”
  • res: “nlâzezeârq”

One more thing, how to create a vocabulary.txt with this kind of dataset ?

thanks

I would not expect good results with that kind of settings. If you are interested in french, please join https://github.com/Common-Voice/commonvoice-fr and https://github.com/mozfr/besogne/wiki/Common-Voice-Fr

Thank you, but how to calculate the appropriate value ?

I’m working using same dimensions as English released model, 2048

Same result with ./DeepSpeech.py --dev_files /data/clips/dev.csv --test_files /data/clips/test.csv --train_files /data/clips/train.csv --train_batch_size 80 --dev_batch_size 40 --test_batch_size 40 --n_hidden 2048 --epoch 100 --dropout_rate 0.30 --learning_rate 0.0001 --report_count 100 --use_seq_length False --checkpoint_dir /data/checkpoints --export_dir /data/models --alphabet_config_path /data/alphabet.txt 2>&1 | tee output.log

It depends so much on a lot of details that you don’t share. Please join the existing efforts to produce a model …

I use the french common voice data set (74 hours) that i previously transformed with the command import_cv2.py. My alphabet is

### Reading in the following transcript files: ###
### ['/data/clips/train.csv'] ###
### The following unique characters were found in your transcripts: 
###
â
z
'
!  
œ
ë
í
g
q
=
ê
n
l
°
ñ
)
a
r 
î
i
ç
e
—
ù
j
y
ï
á
…
½
«
û
w
;
p
’
é
/
ô
ö
ÿ
à

d
:
x
h
u
b
k
ü
º
»
–
s
è
v
m
c
o
t
f
### ^^^ You can copy-paste these into data/alphabet.txt ###

My container

# git-lfs repository
RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash

# Dependencies 
RUN apt-get update && \
    apt-get install -y \
        sox \
        libsox-fmt-mp3 \
        git-lfs \
        libboost-all-dev \
        cmake \
        zlib1g-dev \
        libbz2-dev \
        liblzma-dev
        
RUN git clone https://github.com/mozilla/DeepSpeech.git /var/lib/deepspeech

WORKDIR /var/lib/deepspeech

# Install kenlm
RUN wget -O - https://kheafield.com/code/kenlm.tar.gz |tar xz && \
    mkdir kenlm/build && \
    cd kenlm/build && \
    cmake .. && \
    make -j 4

RUN git lfs install && \
    pip3 install -r requirements.txt && \
    python3 util/taskcluster.py --target . && \
    pip3 install $(python3 util/taskcluster.py --decoder)

VOLUME ["/data"]

Well, that’s not super surprising, you are on an old version of the dataset, and there is work that we need to do to improve its quality. Again, please join the efforts I linked above, there’s no point in everyone re-doing the same work and hitting the same issues again and again, efforts needs to be shared.

So, this dataset is unusable? what a pity!

Thank you for your help lissyx.

Please be mindful, I never said it was “unusable”. We are just at the beginning, so some work is needed. Common Voice is intended not only for DeepSpeech, so there is cleanup we cannot do on Common Voice but that needs to be done when training with DeepSpeech.