Actually it started improving after ~2k batches. But still very bad even after 5k:
2018-07-31 12:19:31 train 5555 | loss 66.7 | CER 0.655 | WER 1.061
“chapter seven parry” vs “go ssn ca”
2018-07-31 12:19:37 train 5560 | loss 76.1 | CER 0.676 | WER 1.013
“oh yes said the dying girl” vs “o y sta the tii go”
2018-07-31 12:19:44 train 5565 | loss 72.2 | CER 0.653 | WER 1.032
“brutus why” vs "oess "
2018-07-31 12:19:51 train 5570 | loss 76.9 | CER 0.619 | WER 0.988
“yet in the broad light of the forenoon” vs “s in the do lla o the fforr me”
2018-07-31 12:19:58 train 5575 | loss 73.5 | CER 0.623 | WER 1.035
“sylvie sylvie” vs “sse ssoley”
2018-07-31 12:20:05 train 5580 | loss 78.6 | CER 0.649 | WER 1.058
“he exclaimed now is the time” vs “h ssin nnss an tii”
2018-07-31 12:20:14 train 5585 | loss 78.9 | CER 0.635 | WER 1.042
“a savage finds in a wreck on the coast” vs “a sstee innes and a rreann ecccosss”
For comparison, other implementations I tried (fordDSP, yao-matrix, zzw992cn) produce better results than this even after 300 batches. For example, here’s what yao-matrix code produces after 3k batches:
THEY WERE RUN OUT OF THEIR VILLAGE vs TEY WERE ON OUT OF THEIR VILLAGE
THE WHOLE THING WAS A TRIFLE ODD vs TE OTHING WAS ATRIFLOND
I HAD NO ILLUSIONS vs I HAD NO OLUSIONS
HE CHECKED THE SILLY IMPULSE vs HE CHETHE SILY IM PULSE
SO HE’S A FRIEND OF YOURS EH vs SO HE’S A FREND OF YOURS ANY
A MAN IN THE WELL vs A MAN IN TE WELL
I COULD NOT HELP MY FRIEND vs I COULD NOT HELD MY FRIEND