It’s basically impossible for us to gauge the time required without knowledge of the distribution of snippet lengths in your data set. For example, a really long sample could force a batch size of 1 and make training take very long.
However, by way of comparison, when we train on the 1k hours of LibriSpeech using 8 Titan X Pascal GPU’s it takes several days to converge.
As to decoding time on a CPU and/or GPU, it depends on the CPU and/or GPU. The surest way is to try. By way of comparison we’ve gotten faster than real time on a 1070 for clips of approximately 5sec in length.
As suggested in the README, the architecture is currently geared towards dealing with shorter clips of about 5 sec. So for a 30sec clip YMMV.
However, a streaming interface is current in the works[1] and should lift this limitation.