Running TTS on constrained hardware (+ no GPU)

Has anyone looked at the practicalities of running this TTS inference on constrained hardware, such as a mobile phone / Raspberry Pi?

I haven’t got to the point of trying this myself yet but it would be useful to hear if anyone tried it and/or if it’s on the road map for the project.

I’m assuming the inference time would be measurably longer, if it’s possible at all - of course, maybe not having a GPU would be a deal breaker (??)

If it weren’t exceptionally slow it might still be reasonably usable for a number of scenarios as it’s fairly easy make the Demo server cache results (helpful where the bulk of your responses are typically from a common set of spoken output, which wouldn’t need inference after the initial time)

1 Like

I don’t think you can go down as much as RasPi, but if you don’t use a neural vocoder with Tacotron architecture, TTS is able to reach real-time on CPU.

Nevertheless, our ultimate goal is to optimize all the code to be able to run on low resource systems. Any contribution ahead is also always welcome.

1 Like

Inference from text to mel spectrograms runs fine on an older 4 core Intel cpu. It’s about 10x faster than real time. There are many tricks that can be used to speedup and reduce memory use further - pruning weights, quantization to int16 or even int8. This should make it fast enough to run on a high end phone.

WaveRNN type vocoders are fast enough for real time synthesis on a laptop CPU and may be fast enough on a mobile with some additional optimization.

However, while it may run fast enough it may drain the battery too fast for some applications (e.g. audiobooks).