Hi, I’m trying to start testing my setup with small Spanish dataset (80h) then I will start processing the whole dataset, the problem is a that is showing a #
.
Problem:
WER: 7.000000, CER: 76.000000, loss: 516.562012
- src: "iirn nh#ai ct#hihc "
- res: " t#tE f fh#it #n a #ch# niit #h fh chs# i #tcnt #hts#i #sn tsn #r # fh #iih#hts#i #lih c "
--------------------------------------------------------------------------------
WER: 3.000000, CER: 8.000000, loss: 39.043110
- src: " ih#nn#hssts"
- res: " t n #it"
--------------------------------------------------------------------------------
WER: 2.500000, CER: 9.000000, loss: 21.328753
- src: "iinhst# Ehsit"
- res: "iinh #it # ihst #it"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 3.000000, loss: 6.723901
- src: "ctsns"
- res: "ct #nn"
--------------------------------------------------------------------------------
WER: 2.000000, CER: 2.000000, loss: 8.263452
- src: "hstnhfnhst "
- res: "hstnhcn st "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 4.000000, loss: 13.701299
- src: "iitlne "
- res: "it#nn n "
--------------------------------------------------------------------------------
WER: 2.000000, CER: 6.000000, loss: 15.189466
- src: "lhfh#ch#nn#a "
- res: "lhc #nni "
Here’s my alphabet.txt
# Each line in this file represents the Unicode codepoint (UTF-8 encoded)
# associated with a numeric label.
# A line that starts with # is a comment. You can escape it with \# if you wish
# to use '#' as a label.
a
b
c
d
e
f
g
h
i
j
k
l
m
n
o
p
q
r
s
t
u
v
w
x
y
z
ü
á
é
í
ó
ú
ñ
# The last (non-comment) line needs to end with a newline.
The util/check_characters.py
is showing:
### Reading in the following transcript files: ###
### ['/data/home/neoxz/Desktop/deepspeech/data/dev/dev.csv', '/data/home/neoxz/Desktop/deepspeech/data/test/test.csv', '/data/home/neoxz/Desktop/deepspeech/data/train/train.csv'] ###
### The following unique characters were found in your transcripts: ###
['m', 'h', 'z', 'v', 'j', 'k', 't', 'o', 'a', 'ñ', 'd', 'e', 'u', 'c', 'q', 'é', 'r', 'á', 'ü', 'l', 'b', 'ó', 'x', 'i', 'f', 's', 'n', 'g', ' ', 'y', 'ú', 'w', 'í', 'p']
I was using my own LM with 2m sentences with default English to start testing it (including training ones), I’ve removed the lm flags and currently running 1 epoch to see if the lm is the problem.
I’m still getting familiar with the whole process so, let me know if I’m missing any important information about what I’m doing.
The ./bin/run-ldc93s1.sh
works perfect with the current setup.
To build the lm I’ve used the generated alphabet chars to check the lm sentences. All checks passed.
Any idea?