Hi there,
TLDR: We want to run the evaluate.py
script multiple times setting different values to the language model hyperparameters lm_alpha
and lm_beta
. Since we’re using python to do the driving, we want to run evaluate
and then switch the tf.FLAGS
as needed and run it again.
It’s a simplication, but picture we’re doing something like this:
graph = create_inference_graph(batch_size=FLAGS.test_batch_size, n_steps=-1)
FLAGS.lm_alpha = 0.1
samples = evaluate(test_data, graph)
FLAGS.lm_alpha = 4
samples = evaluate(test_data, graph)
Just doing this, we get the following error:
F tensorflow/core/framework/tensor.cc:793] Check failed: limit <= dim0_size (1058 vs. 1057)
which I think is because we’re trying to run a computation graph that we have already initialised and run. I’m not really sure what to do with this error.
Failing this, the other way to do it is use python to drive evaluate.py entirely from the command line. This is fine, but requires a lot of file manipulation and a temp directory, so it wasn’t the first thing I considered trying.
Anyway, I hope someone has something useful.
Added context:
In our experiments, manual hyperparameter optimisation of lm_alpha
and lm_beta
made a difference of about 2-3% to our validation error, so I’m working towards automating the process.
I’ve put together a fairly straightforward strategy following some experiments in which I learned the following:
- The word error rate appears to be a convex function of
lm_alpha
, for a fixedlm_beta
andbeam_width
. This suggests that a simple binary search will converge just fine.
- For sensible values of
lm_beta
(from say, 0.1 to 4), the word error rate is somewhat linearly increasing inlm_beta
, but there is a lot of noise, so finding a good value is not so simple.
- However, as the names suggest,
lm_alpha
has a more significant effect on the word error rate thanlm_beta
does. - Unsurprisingly, increasing
beam_width
also decreases the word error rate but in an exponentially decaying fashion. From what we saw, the default of 1024 is probably a good balance in the tradeoff between performance and inference time. 2048 did do better, but only in the second or lower decimal place (e.g. 11.32% vs 11.36%).
So the strategy is to:
- Fix the
beam_width
at 1024 - Fix beta at 1
- Do some kind of search on
lm_alpha
, assuming the best value lies somewhere between 0.1 and 4 (this will probably require ~5 evaluations) - Fix
lm_alpha
at the best value, and do a grid search overlm_beta
from 0.1 and 3, with a grid width of about 0.1 (about 30 evaluations)
For our (admittedly small) evaluation set evaluate.py
, each evaluation takes about 30 seconds.