So you suggest to add the vocabulary of every language in the corpus of French in order to be able to understand every word of every language ?
As of today, Speech Recognition Engines are trained on independant languages. In my opinion, the problem does not come from the fact that French sentences don’t have foreign expressions, but from the fact that the current AIs we developped don’t have a global understanding of speech.
I think the goal of Common Voice is to have independant datasets for each language, and if an AI developper wants to have an AI able to understand English expressions inside French sentences then it is up to him to develop a solution that combines an English model with a French one.
The way we collect data today should not be biased by the limitations of the state of the art of today.
The problem I have with adding foreign expressions in French is that CV tries to gather datasets specific to each language and adding foreign expressions in the sentences would lead the foreign words to be added in the French vocabulary like if they were French, and they are not.
So at the end of the day, we try to have independant datasets but in the same time we are creating a global vocabulary including every language instead of a specific French vocabulary.
Foreign expressions should stay considered as foreign expressions.
The ideal we should aim at in my opinion is to have independant vocabularies and then merge them at training time if you need to be able to understand mixed sentences.
What I am trying to say is that, yes in real life we often mix the languages in one sentence, there are no clear boundaries between languages. But actually the way we model speech today has boudaries and that’s why we have so much trouble when using foreign expressions.
The problem does not come from the data, but from the way we model speech.
Still I understand we could have a different opinion on this issue.