I’m not sure what you are referring about. First versions had some limitation requiring the whole audio to be known at inference time, and performances were suboptimal on long audio files. This has been fixed a long time ago.
As you can see on each releases pages, yes, we only have English for now. Sourcing material is in itself a long job, so we can’t do everything.
I’m not sure I understand your question …
That depends on your hardware …
That depends on your system. Any desktop CPU should provide more than realtime performances. We could verify above realtime on Android devices with Snapdragon 820 and 835, as well as on Raspbian running on RPi4.
You are comparing unrelated things: benchmarks and real-life use. We still have a long way to go, especially regarding amount of data. This will make the model more reliable (noise, accents, etc).