Hello all,
We thought we’d mention that over at the Te Hiku / Kōrero Māori project we’re doing some work on the decode_metadata() methods to add a bit more info the the MetadataItem object, specifically around confidence per letter.
This work might feed into some projects around pronunciation. Another thing we might do with it is to show some level of confidence in our transcription UI - to give some sense for when the model is confident of a transcription and when/where it might be worth the human reviewer to look a little closer.
We made a branch at around ‘deepspeech-0.5.0a8’ and we’re hoping that we’ll be able to turn it into a PR at some point in the future.
At this point the code is working for our experimentation but the PR is not ready. We just thought it might be good to sort of mention that we’re doing this in case it overlaps with other work already going on or coming soon.
It’s really only a few changes to deepspeech.cc decode_metadata
method
557: ModelState::decode_metadata(const vector<float>& logits)
and to MetadataItem
in deepspeech.h (where we’ve added three new properties for now)
// Stores each individual character, along with its timing and confidence information
struct MetadataItem {
char* character;
int timestep; // Position of the character in units of 20ms
float start_time; // Position of the character in seconds
double probability; // Logit value at the time the character was chosen
double entropy; // Entropy across all logits at the time the character was chosen
char* acoustic_char; // Best guess from acoustic model at timestep of chosen letter (sometimes differs from best guess overall)
};
Our current plan is to experiment with the above fields with some real world data, so we can see which of these confidence measures is actually useful, maybe tweak it a bit based on that feedback and then create a PR. So it will be a wee ways off and of course we fully expect we may have to adapt or minimize our changes even more based on feedback during the PR process.
That said, we figured we might as well get the word out there that this is something we’re working on.
In case there are replies I figure I’ll just tag my collaborators at TeHikuMedia @kmahelona and @mathematiguy and maybe I’ll tag @lissyx on this one as well, since I guess you were the person who added the timing metadata stuff in the first place.
PS Some example output… you’ll notice at 8.20 seconds the acoustic model guesses ‘n’ but the language model corrects that in the final transcription to ŋ
(aka ng
).
Target transcription
# Ka whakapā a Hine ki tētahi āhuatanga whakahirahira o te whakamahinga o te reo Māori
Actual transcription (this one has 0% WER)
# ka whakapā a hine ki tētahi āhuatanga whakahirahira o te whakamahinga o te reo māori
Raw output transcription (in our new Te Reo Māori specific orthography)
# ka ƒakapā a hine ki tētahi āhuataŋa ƒakahirahira o te ƒakamahiŋa o te reo māori
Raw transcription if we only used the acoustic model
# ka ƒakapā a hine ki tētahi āhuataŋa ƒakahirahira o te ƒakamahina o te reo māori
'char':seconds:'acoustic_char' probability entropy
'k':1.28:'k' prob:0.997742 entropy:0.025697
'a':1.48:'a' prob:0.999660 entropy:0.005033
' ':1.42:' ' prob:0.998576 entropy:0.015578
'ƒ':1.46:'ƒ' prob:0.999910 entropy:0.001490
'a':1.62:'a' prob:0.999978 entropy:0.000389
'k':1.60:'k' prob:0.999788 entropy:0.003169
'a':1.62:'a' prob:0.999978 entropy:0.000389
'p':1.84:'p' prob:0.991591 entropy:0.083503
'ā':1.86:'ā' prob:0.669923 entropy:0.946214
' ':2.22:' ' prob:0.898275 entropy:0.509989
'a':2.24:'a' prob:0.997645 entropy:0.026884
' ':2.46:' ' prob:0.636555 entropy:0.950531
'h':2.50:'h' prob:0.994121 entropy:0.061711
'i':2.52:'i' prob:0.998789 entropy:0.015438
'n':2.68:'n' prob:0.999938 entropy:0.000999
'e':2.70:'e' prob:0.998212 entropy:0.021415
' ':3.38:' ' prob:0.841081 entropy:0.633549
'k':3.08:'k' prob:0.965924 entropy:0.249148
'i':3.10:'i' prob:0.994305 entropy:0.060981
' ':3.38:' ' prob:0.841081 entropy:0.633549
't':3.66:'t' prob:0.999955 entropy:0.000785
'ē':3.44:'ē' prob:0.985221 entropy:0.119456
't':3.66:'t' prob:0.999955 entropy:0.000785
'a':3.68:'a' prob:0.999769 entropy:0.003429
'h':3.80:'h' prob:0.999525 entropy:0.005986
'i':3.82:'i' prob:0.999887 entropy:0.001707
' ':4.06:' ' prob:0.999827 entropy:0.002440
'ā':4.12:'ā' prob:0.999900 entropy:0.001572
'h':4.38:'h' prob:0.999580 entropy:0.005927
'u':4.40:'u' prob:0.999936 entropy:0.001080
'a':4.66:'a' prob:0.998985 entropy:0.012388
't':4.64:'t' prob:0.999953 entropy:0.000757
'a':4.82:'a' prob:0.999497 entropy:0.006564
'ŋ':4.80:'ŋ' prob:0.999840 entropy:0.002288
'a':4.82:'a' prob:0.999497 entropy:0.006564
' ':5.66:' ' prob:0.994251 entropy:0.053260
'ƒ':5.70:'ƒ' prob:0.994764 entropy:0.057157
'a':5.72:'a' prob:0.998290 entropy:0.020614
'k':5.86:'k' prob:0.995913 entropy:0.039462
'a':5.88:'a' prob:0.995949 entropy:0.040058
'h':6.08:'h' prob:0.996310 entropy:0.038118
'i':6.10:'i' prob:0.997903 entropy:0.025600
'r':6.20:'r' prob:0.999943 entropy:0.000914
'a':6.22:'a' prob:0.999786 entropy:0.003344
'h':6.36:'h' prob:0.998314 entropy:0.018087
'i':6.38:'i' prob:0.994112 entropy:0.059620
'r':6.52:'r' prob:0.999738 entropy:0.003533
'a':6.54:'a' prob:0.999723 entropy:0.003865
' ':6.86:' ' prob:0.999614 entropy:0.004946
'o':6.92:'o' prob:0.992238 entropy:0.078715
' ':7.06:' ' prob:0.999666 entropy:0.004346
't':7.08:'t' prob:0.999473 entropy:0.006716
'e':7.10:'e' prob:0.999567 entropy:0.005573
' ':7.30:' ' prob:0.999920 entropy:0.001255
'ƒ':7.34:'ƒ' prob:0.999175 entropy:0.010537
'a':7.36:'a' prob:0.999082 entropy:0.011338
'k':7.50:'k' prob:0.999428 entropy:0.007066
'a':7.80:'a' prob:0.999715 entropy:0.003959
'm':7.78:'m' prob:0.999868 entropy:0.002160
'a':7.80:'a' prob:0.999715 entropy:0.003959
'h':8.02:'h' prob:0.994506 entropy:0.051605
'i':8.04:'i' prob:0.997908 entropy:0.024319
'ŋ':8.20:'n' prob:0.163235 entropy:0.758166
'a':8.22:'a' prob:0.986215 entropy:0.106885
' ':8.84:' ' prob:0.999548 entropy:0.005679
'o':8.90:'o' prob:0.996885 entropy:0.037253
' ':9.08:' ' prob:0.999404 entropy:0.007245
't':9.10:'t' prob:0.999635 entropy:0.004874
'e':9.12:'e' prob:0.999580 entropy:0.005448
' ':9.20:' ' prob:0.999891 entropy:0.001612
'r':9.24:'r' prob:0.999954 entropy:0.000807
'e':9.26:'e' prob:0.999885 entropy:0.001855
'o':9.38:'o' prob:0.999943 entropy:0.000926
' ':9.60:' ' prob:0.999293 entropy:0.008496
'm':9.64:'m' prob:0.999928 entropy:0.001178
'ā':9.66:'ā' prob:0.999976 entropy:0.000439
'o':9.84:'o' prob:0.999944 entropy:0.000880
'r':9.94:'r' prob:0.998312 entropy:0.019489
'i':9.96:'i' prob:0.999996 entropy:0.000085