Cached Data while generating batches

young_dumb21 · November 19, 2019, 4:54am

Hi, I observed that a lot of data is being cached while generating the batches. I am not clear about what data is being cached and why do we need to do so. Can you help me understand?

lissyx · November 19, 2019, 8:44am

Have you had a look at DeepSpeech.py and util/feeding.py where caching happens?

young_dumb21 · November 19, 2019, 8:49am

Yes, I have looked at it and as per Tensorflow documentation, the cache() function caches the elements of the Dataset. But I am observing that after every epoch, the size of the cache file is increasing, so, I thought, maybe something else is also being cached. Because if it is caching only the elements of the dataset, then, maybe it does not need to cache again after the first epoch.

lissyx · November 19, 2019, 8:53am

But then you erase epoch (n-1) with epoch (n) data, and when you go for a new training, then your cache does not contain what you expected. Or am I misunderstanding your assertion ?

young_dumb21 · November 19, 2019, 8:57am

Then what is the purpose of caching if we erase previous epoch’s data? Because once a dataset sample has been sent to the model then it won’t be required again until next epoch and if we are erasing the cache after every epoch then dosen’t it make the caching process redundant?

What I am trying to ask is why is the model caching the elements at every epoch, what is the purpose of this?

reuben · November 19, 2019, 9:01am

It is not supposed to be caching on every epoch, what you’re seeing is probably a bug. But as you can see from our code, all we do is call tf.data.Dataset.cache. So if it is a bug, it’s on TensorFlow’s end. Are you caching to disk or memory? Do you specify the --feature_cache flag when training?

shan18 · November 19, 2019, 9:07am

Earlier we were trying to cache to memory which was causing memory overflow when we used sorted wav files. I raised this issue on GitHub where you suggested to use the --feature_cache flag.
After using the --feature_cache flag to cache into disk, the memory issue was resolved but then we saw that the cache file size is increasing after every epoch so we were just curious as to what was being cached what was the purpose of caching.

lissyx · November 19, 2019, 12:07pm

That was exactly my point. Looks like I should avoid providing support before getting coffee. Mixed epoch and batch in my mind …