How exactly the decoder (and especially tf.nn.ctc_beam_search_decoder) works?

monakons · May 20, 2018, 8:32pm

I know that this also a tensorfow question, but tensorflow has poor documentation on this.

This is what I understand from some external documentation.

In general this layer is capable to (1) Implement a softmax layer to convert the output to a probability distribution over symbols (2) eliminate repeated characters between the blank symbols that comes from the acoustic model (3) Implement beam search on a prefix (character) tree and extract the most probable sequence (4) Optionally this output can be fed into the language model that is responsible to “correct” this output in word level based on known word sequences.

If the above are correct my questions are :

(1) Where is the prefix tree that tf.nn.ctc_beam_search_decoder includes ? Or it does not ?
(2) Is the Language Model indeed used only in word level? Is it possible to be used in character level?
(3) I removed the tf.nn.ctc_beam_search_decoder from the protobuf file and implement it later on the pipeline, however, the results using tf.nn.ctc_beam_search_decoder inside and outside the protobuf are different (why is that happening ?? )
(4) Is the Language model necessary while training ? Isn’t the purpose to run the LM on top of the Acoustic model as a standalone module ??

Thanks in advance !

reuben · May 21, 2018, 12:04pm

This paper by Graves is a good reference: http://proceedings.mlr.press/v32/graves14.pdf

tf.nn.ctc_beam_search_decoder does not include LM scoring, so there’s no prefix tree. In the language of Graves’ paper, it fixes Pr(k|y)=1.
You can implement both strategies.
Hard to tell without knowing what you did exactly.
It’s only used for decoding, so only needed for WER reports that show you the actual decoded strings. If you train with display_step=0 and notest, the LM will never be used.

monakons · May 21, 2018, 11:17am

Thanks for the reply,

(1) So the output is just the symbol with the maximum probability ? TF native tf.nn.ctc_beam_search_decoder does not have support for a Language Model but you can set the beam_width parameter (e.g. deepspeech use 1024, or you can use 1 for greedy search). What exactly this layer searches on ?? It should include some transcript probabilistic model or something …

(2) Ok

(3) In this case I first set the output of the protobuf model, to extract logits and fed this output to the decoder, exactly as described in create_inference_graph. The odd think is that I am getting different results (predicted transcript) when the decoder is in or out of the protobuf.

(4) I can see that while training the deepspeech script calls calculate_mean_edit_distance_and_loss that includes the language model while calculating the output. So at each iteration the error will be the difference of the model prediction (with LM) and the ground truth. So, my question still is, isn’t the language model supposed to be used only on evaliation of the model, as a standalone module, that “corrects” the acoustic model output ??

Thanks.

reuben · May 21, 2018, 11:31am

You should really read the CTC paper

The output of the acoustic model is itself a matrix of transition probabilities, you’re searching for paths of maximum combined probability. That is not the same as greedy (argmax) decoding.
Yeah that seems very weird. In the native clients we do exactly that, extracting logits and decoding on the client, and it works fine.
You’re misreading the code, the decode operation is defined in calculate_mean_edit_distance_and_loss, but it’s only used for WER reports, so only on validation and test sets. The loss computation does not involve decoding the output at all.

monakons · May 21, 2018, 11:42am

Hello,

(1) Well I print the output of the acoustic model and saw numbers like -7,-5,5 e.t.c that means that this is not a probability. The probability is calculated using sigmoid inside the decoder. In addition i though that the output of the acoustic model will be probability over symbols (for the current time frame - 20ms) not transition probabilities (I don’t get that ).

(2) Forget about that for now, I will re-examine the code and update here.

(3) You are right i just saw that.

reuben · May 21, 2018, 11:46am

tf.nn.ctc_beam_search_decoder does a softmax on the inputs as its first step.

monakons · May 21, 2018, 12:59pm

Thanks for the reply.

To be more clear what I don’t understand is what is the purpose of beam search if we don’t introduce a LM / prefix tree e.t.c. . Using max decoding makes sense in a way that i am keeping the symbol with the maximum probability at each time step. But, using beam search I am also keeping lower probability symbols as candidate symbols and at the end I am extracting the sequence with the highest probability (but isn’t this the same as max decoding) ? ? Doing some inferences in TF on various wav files, changing the beam_width does not affect the output, but only introducing latency … Is it something that I am missing ?

reuben · May 21, 2018, 3:28pm

Because repeated labels and blank labels are removed, there are many possible paths through the probability matrix that yield the same resulting text. In CTC, the definition of the probability of a given transcription is the sum of the probability of all possible alignments (paths) that yield the same text. A full sum is not viable because the number of paths grows exponentially with the length of the sequence.

Greedy decoding only considers one alignment (path), namely the state with the largest probability at each timestep.

Beam search tries to approximate a full sum over alignments. Using it definitely improves the WER, we’ve tested it.

monakons · May 21, 2018, 3:41pm

Yes, I understand that (and sorry if I am asking again) but isn’t the case that in the end I will end up with candidate alignments (that leads to the same word) and choose the word that comes from the sequence that it’s sum will hold the maximum probability (so like max decoding) ?

monakons · May 21, 2018, 4:01pm

I got it now ! The magic word is "sum of the probability of all possible alignments " so that is why we are doing beam search … This way the total probability of the predicted word will be increased, because more alignments (sequences-paths) that holds bigger probabilities per character end up to this same word !!