Speaker IDS for Speaker Recognition

Hi Admin,

Thanks for releasing the dataset.
I notice that there are no speaker ids provided, but only age, gender and accent, therefore this dataset can’t be used for speaker recognition projects. If you could also provide speaker ids, which I believe can be gathered easily from username, then this will increase the usability of this dataset.

1 Like

Yup, we decided not to add speaker identifiers at this point (for privacy reasons). But, you can check out the Tatoeba dataset from our download page, which does group utterances by speaker.

Would it not be possible to anonymize the tags ? This way there won’t be any privacy issues.

1 Like

They would be anonymized, but there would still be privacy issues. It’s possible we may tackle this in the future, but right now we aren’t looking into that piece. Tatoeba is your best bet for now.

I also wanted to add that the lack of speaker identities also makes the dataset unsuitable for language identification models (that is, of course, when you add other languages). This is because you want to make sure that no speakers in your training set are in your test/validation sets (otherwise you will almost certainly overfit on speaker identities).

Maybe speakers could opt-in to have their samples marked with an anonymized id.

I have a question related to the last post in this topic:

The corpus, as distributed, is split into train/dev/test sets. I understand you’re not sharing speaker IDs at this point. Are the splits done in such a way that a given speaker is not part of training and test set at the same time? This would be very useful information to interpret the generalization performance of an algorithm.

Is a given example sentence only spoken once by every speaker?

Thank you for your answers!

I came here looking for the answer to this question as well? Have you had any more progress with that?

Listening to some of the files I think I can recognize some speakers in both development and validation set (hint: group by age and look at age groups with few examples).

It would be great if someone from Mozilla could confirm that some (or all?) speakers (may?) appear in all sets.

Same here.

Not having the ability to group samples by their speakers is very unfortunate. Are there any updates on this topic? We are interested in knowing the distribution of contributions per speaker since it’s sometimes the case that very few speaker contribute a very substantial part of the spoken data.

Shouldn’t it be enough to throw in a random number (key), hash user-ids with that key and store those hashes as additional field and before throwing away the key?

Thanks for the questions!
We’re including hashed client_ids with this release. And also we’re making sure that users are unique per bucket. You can find the code that does that here:

1 Like

This will be great for TTS research.