Using TTS to automate voice collection - like classic lang. course

linus · May 6, 2019, 10:17am

After doing some voice donation I got the idea that it might be a good idea to use some standard tts motor to read out the sentence and then maybe marked with a beep start the recording and let the contributor repeat the sentence.
This way I the contributor could do this while also doing something else which could gain the time that a contributor can spend on donating the voice, and maybe even allow more people to contribute which might not have the time to spend focusing on a screen.

Not sure if this would be possible from a plain webapp though…

lissyx · May 4, 2019, 1:54pm

99.99% chances it’s against the terms of use of the TTS service, and it’s also going to break the usefulness of the dataset.

sifaks · May 5, 2019, 8:50am

Hi Linus.
I find the idea excellent!
It would encourage blind people to contribute as well as people who cannot read or write. My mother is a native in Kabyle language and she would love to contribute to common voice but she is analphabet and would need assistance. I hope the idea will make its way.

linus · May 6, 2019, 11:36am

I guess it depends on how you set it up, if common voice just push a text string to some TTS motor and the recording is sort of separate I guess the licence of the TTS motor won’t affect the final dataset, but might be tricky to get it that modular. If not possible there are a couple of open source TTS motors, but most seems old and mostly comes in english and maybe a few other languages, so maybe not too useful.

Another way could be to use the commom voice TTS, but would only help once quite a lot of data already been collected, don’t know how much is needed to have a TTS model that is usable. Here are some examples of the English one, there are even decent examples there are a few months old and the English dataset still has a few hundred hours to reach the 1200h goal.

linus · May 6, 2019, 11:32am

That’s reasons I didn’t even think of but really interesting! I guess I’m the podcast generation and like to do 1 thing with my ears and another thing with my hands Do you know if there are any present TTS motors that support Kabyle, open or proprietary?

lissyx · May 6, 2019, 11:37am

The point of Common Voice is to collect human voice. Using TTS completely defeats that.

linus · May 6, 2019, 11:39am

I’m not sure you’ve understood my point in that case.
My idea is to just replace “view text on screen - read” part of the process, not the “speak-record” part.

I.e you listen to a sentence, read by some TTS, the recording starts and you repeat the sentence with your voice and pronunciation.

lissyx · May 6, 2019, 11:44am

Ok, you’re just making the process longer and more complicated, I clearly don’t see what value or improvement this brings.

linus · May 6, 2019, 11:48am

As I stated there are many situations were it is not possible to contribute, if you need to read from a screen, ie as sifaks mentionend, if you are disabled in some way, or for me it would be useful to be able to contribute while walking or doing something else. How do you suggest that could be done less complicated?

lissyx · May 6, 2019, 11:48am

Likely @belkacem77 could guide you here. The fact that he’s so active on Kabyle and interested in TTS makes me believe there is currently no solution.

lissyx · May 6, 2019, 11:51am

Blind people would already have a screen reader.

If you’re walking, you should look forward to not bump :-).

I don’t see any simple way to integrate that. Some of your points might be valid, but I’m not sure the Common Voice team has the bandwidth to address those.

Maybe, if you can, sending a PR is the best way to try and get traction ?

linus · May 6, 2019, 12:01pm

Maybe that’s the solution, probably not so optimized for this but maybe good enough.

Yes, exactly, that’s why I would prefer to take in the info with my ears not eyes, so I can look forward.

I get that there is not a great chance that such project will get prioritized by the team and I’m unfortunately quite bad at coding still but, I was more after some feedback, and maybe some hint of were to look further. I’ll probably investigate what I can do myself anyhow and hopefully come up with a PR some day or suggest some modular setup if others are interested.

belkacem77 · May 6, 2019, 12:27pm

@lissyx

There is no TTS or STT engine supporting Kabyle yet. We are gathering voice data to do it .

sifaks · May 6, 2019, 1:37pm

Yep, @belkacem77 is the right guy to ask here.

Michael_Maggs · May 6, 2019, 1:40pm

An interesting idea, but it would also destroy diversity where there are multiple variants of the same language. To take one example, assuming the TTS system speaks US-English, British, Welsh, Scottish, Australian, Indian and many other speakers will be ‘encouraged’ (in fact almost forced) to copy ‘incorrect’ US pronunciation and stress patterns. Also, even the best TTS systems make frequent errors in pronunciation and in stress, especially with unusual words and proper names. Listeners will simply replicate what they hear and lock those errors into the recordings database.

linus · May 6, 2019, 1:45pm

That’s actually a valid point, that if the TTS is bad the risk that it spills over in the actual human recording, that’s what you meant, right? Hard to tell to what degree, did not thought of this though.

dabinat · May 6, 2019, 4:09pm

I like this idea because users would no longer be reading text but reciting it from memory, which may lead to a more natural, conversational style of speech.

But as @Michael_Maggs pointed out, there would be a lot of challenges in implementing it reliably.

lissyx · May 6, 2019, 5:03pm

This just advocates for more guidance when people record, you can (should) read the sentence a first time and recite it from memory with the current setup. Also, I think it’s good we have not just this type of speech, and people not talking fluently, etc., it’s as valid IMHO.

sifaks · May 7, 2019, 9:46am

What about not using TTL but having the possibility to re-record the sentence we are listening. It is still useful for blind people and people not able to read? All sentences has to be recorded several times anyway!

nukeador · May 7, 2019, 10:55am

Well, in fact, ideally we only want/need one record per sentence. Having a lot of records for the same sentence is not very useful for the Deep Speech algorithms.