Sharing Common Voice Through peer-to-peer

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 11, 2019, 8:19pm

Hi everyone! I created a torrent for all datasets on Common Voice currently available, you can download though this link: magnet:?xt=urn:btih:6318a9e4735b4cdc6c88ccbd9f16e9c1c016ed88&dn=Common+Voice+V2+March+2019.rar

lissyx · March 8, 2019, 4:09pm

This is something I was asking about the other day, but your magnet link seems not super-available. @nukeador Could we create a torrent and spread it on some hosts to help disseminate?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 4:32pm

Yep, I’ll do that. I was hoping to get some people seeding from this community.

lissyx · March 8, 2019, 4:41pm

If you can share it with me directly, I can start seeding.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 4:42pm

How do I do that? Like I’m already sharing through p2p.

lissyx · March 8, 2019, 4:46pm

Direct link to some hosting? Then I can download it completely and quickly and start seeding. Because right now it’s very slow

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 4:50pm

Well, I don’t have it. I downloaded from from Common Voice and organized each language in its own folder. I would host on mega, but I don’t think you can transfer 35Gb on it. I hope my torrent get some seeders, we just need some seeders to start, I mean the whole point of p2p, is how scalable it is.

nukeador · March 8, 2019, 4:51pm

Hi everyone,

I would like to explain why the Common Voice team prefers the dataset to be available just from the official site and don’t create new unofficial places.

The main reason is that we want to make sure we have a way to contact to everyone who has downloaded the dataset in case someone request us to remove their voices from the dataset. This is important and we want to respect people’s choices.

If you feel the current download process is not working for you, let’s talk about that and find solutions, but I would like to request we don’t create other places for this download so we also avoid people getting confused about the official one and potentially downloading an outdated or manipulated dataset.

Thanks!

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 4:57pm

@nukeador Got it, I will stop my torrent, But I have a different opinion, what’s the point of the dataset being CC-0, if you can’t share it? With regards with people wanting their voices removed, how do you know which clips belongs to whom?

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 5:02pm

I always prefer to download through p2p, I think Common Voice also should have this option as many open source programs and linux distribuitions.

nukeador · March 8, 2019, 5:02pm

Having cc-0 is specially important to be able to use the dataset by many different commercial and non-commercial entities. In terms of sharing, we prefer to always point the official site because of the reasons I listed, we think people will understand.

The site knows which speaker ID belongs to which user, but this information is not exposed in the dataset for privacy reasons.

nukeador · March 8, 2019, 5:03pm

What’s the problem you are currently experiencing when downloading the dataset? Is it speed? Other? We can look into it.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 5:05pm

Thanks for clarifying, One think that was odd was is the tar files, you have two extract twice, why not use .zip or .gzip?

nukeador · March 8, 2019, 5:07pm

Do you mean tar.gz file? You should be able to extract it at once with 7-Zip, command line or any other archiver that supports this format.

Codigo_Logo_Programacao_e_Inteligencia_Artificial · March 8, 2019, 5:09pm

@nukedor I think all the dataset should be put into one file, so if I extract I should have a good workspace like this:

And maybe split them into subfolders, having a half a million files in one single folder makes it harder to play some of the files.

nukeador · March 8, 2019, 5:09pm

Understood. Pinging @gregor and @kdavis here for their feedback on how to improve this.

kdavis · March 11, 2019, 9:18am

I think having both options available, all in one zip and each language in a separate zip, seems reasonable. Some people want to work with all languages, and some with only a single language.