Fine-tuning DeepSpeech Model (CommonVoice-DATA)

hey @alchemi5t,

Thanks for your answer, it gets me to understand more batches
Well it was me who was wrong, didn’t see it that way. I was in the old and wrong answer of your quote :stuck_out_tongue: It is logical what it is said in your quote, a bigger batchsize will avoid to fall in a local minima.

I’ll try to go with batchsize = 64, don’t know if my machine will handle it though

I don’t get this part, what does that mean ? How do you do that ?

EDIT: That’s what I feared, my machine can’t handle batchsize 64… More reason to understand what means the quote up

Here what I get using LR 0.0001, batchsize 32, epoch 1, droplayer 0, finetuning dropoutrate 0.25, train CV, test CV :

WER = 52.96% CER =34.57% test_loss = 50.90
training_loss = 44.15
validation_loss= 44.32

I’m sure I miss something here, but I really don’t see what…

You have to code and change the pipeline for that. Instead of updating the gradients immediately, you store it, calculate for another how many ever batches(e.g., your max possible batchsize is 32, then get gradients for first batch and then instead of immediately updating weights, get the second batch’s gradients also and average both and then update weights to effectively get a batchsize of 64), average gradient and then propagate(update) it back.

Edit: I didnt see that you already got what that meant.

If I understand it well (and there is a lot of chance that I miss something or give a bad explanation), you have to drop the last layer (in order to make it sensible to your language and still use the work on the old language). You train on the last layer with your language dataset and adapt the lm.

You can also merge 2 acoustic model (i know the theory but in practice I don’t know how to do it) but it’s more for a non native accent oriented model.

Well, it was not what I expected haha, I’m reluctant to modify the code, don’t want to make some bug in it.
Even if I do this, I think there’s more to search because I’m at 52% and I saw people who can go down to 22% with CV.
I still don’t get why my loss never goes lower than 30 and yours is lower than 5…

I have a few reservations. I’ll make a new thread to discuss this. Thank you for your inputs!

ask @lissyx, he works on TL for french language if I remember well, he may explains it better than me

The only transfer learning I’m doing for french is not really the same transfer learning as discussed above.

BTW, you mention not being able to go lower than 30 of loss. Is this with CV FR ? I’m pretty sure I already shared you links to the Github issues: data inside common voice needs some love; and I have not been able to have time to do that. And so far, nobody cared.

Yeah, I remember the different issues for the french CV dataset, that’s why I use english CV for now. See it as a POC, to get use to TL, DeepSpeech and ASR.
I made few sentences in CV but same as you, don’t have a lot of time to do it…

The Common Voice dataset contains clips with errors. I’m working on building up a list of the offending clips so they can be put in the invalid clip CSV, but in the meantime, if you see a transcript that is wildly different from what it’s supposed to say, you can look up the transcript in the test CSV to get the original filename, then play it back and see if it’s correct. If not, just remove it from the CSV.

I was able to improve the WER a few points just by removing a few of the worst offenders from the test set.

Good idea, I’ll do that ! Every tips is good to take :slight_smile:

Not sure if it’ll help for my loss issue though…

Thanks a lot @dabinat !

Hello there, where did you get pre-trained model of Common Voice data?
Can you please share the link?

Hi @Rosni07 welcome to the forum :slight_smile:

You can find pre-trained models on the releases page: https://github.com/mozilla/DeepSpeech/releases

Look down until you find one that isn’t -alpha and it should contain links to the pre-trained models and other downloads / details you’ll need.

Btw this part of the forum ( #deep-speech ) is associated with that repo, so it’s generally worth you having a look through the repo before posting for simple stuff like this

I already checked on the released models page but the recently released model was trained on American English and I was wondering if there is one on “CommonVoice dataset” ?

The pretrained model doesn’t give accurate result maybe it is because of the differences between the accent and pronunciation of trained data (American English) and my voice(Female Asian-Indian accent). So, I was wondering if there is any pre-trained model of CommonVoice data.

I see - I hadn’t picked that up from your initial question.

0.4.1 did include a snapshot from English Common Voice. There was some discussion on this [here] (Any reason 0.5.x models weren't trained on Common Voice data this time?) : why 0.5 didn’t include Common Voice (was an oversight) and that it’s likely to be in the released models for 0.6 once it’s out of alpha.

If you’re trying to improve it specifically for Indian English some fine tuning might help (I know others on here have been looking at that for Indian accents but I don’t know how they’ve got on). Another approach to consider would be including Indian sourced text in the LM, since that could help it cope with “Indianisms” that aren’t typically part of the American / British English data that likely make up the bulk of the LM data.

1 Like

Okay, thank you for letting me know that and thank you for welcoming me to this forum…didn’t notice that earlier. :sweat_smile::slightly_smiling_face:

1 Like

Sadly, that’s very much likely your issue. Even with the Common Voice data, the amount and the diversity available right now is not yet enough to provide a noticeable increase for non-american english.

How much bad is the outcome?

I haven’t tested at wide range. I did some minor testing like “hello”, “this is testing file”, etc result wasn’t accurate for those small words and I had one more problem/query. The model generated text for the noise that came along with audio.
For eg: “1 sec pause and saying hello” gave “and hello”

and my voice(Female Asian-Indian accent), I meant for the general user/people I will likely to use to test the project I am doing (since I am from Asain country) :slight_smile:

Well, noise robustness is also something we still have work pending, so it’s not surprising.

Helping the Common Voice project reaching a wide range of contribution would likely help: for example, there are still ~300h recorded but not validated on the English dataset.

You can also have a look at https://activate.mozilla.community/commonvoice for contribution tips on Common Voice. But getting more indian-speaking people and making sure they do document that in their profile would likely help.

2 Likes