Feature request: Adding emotion to sentences

Hi all,

There is no big database for speech emotion recognition (without text information).

Since we are already collecting voices, I think it will be very useful to have also a speech emotion database that are language independent and gender independent.

When a person contributes a sentence, he/she will also be asked to specify the emotion associated with the sentence.
(neutral, happy, sad, angry, scared, bored, surprised, disgusted)

When validating the sentence, the user will also try to identify the correct emotion.

What do you think about this feature request ?

Tan

I love the idea! That said, I would be worried that “faking” an emotion is not the same as actually displaying that emotion in your voice. So what we would be doing is collecting a dataset of faked emotions, which I think would be less useful for machine learning.

Yes, there is a difference between acted speech vs natural speech. Nevertheless, it will be the first step.

It is the same challenge for speech recognition : acted speech by reading sentence with neutral emotion vs natural speech with different emotions.

By having different emotions, it will probably increase the speech recognition accuracy.

The second step is to have natural speech. We can get “free/public domain” podcasts with natural speech, automatically chopping in small sentences by silence, passing through the speech recognition engine and ask people to validate/correct the sentence and emotion.

Tan

One of our goals for the year is to move away from read speech into spontaneous speech. We have some interesting conceptual ideas around this, but probably won’t start work properly until q4 this year.

If you know of any public domain (cc-0) podcasts in any language, please link them here! I agree, these are a great source of voice data if we can align them properly.

@mhenretty Are there any updates on the plan to include spontaneous speech (as opposed to read speech) soon? I actually wonder how useful podcasts would be since they tend to be scripted, or at the very least, a lot more polished speech than the way normal people speak.

A suggestion for creating a crowdsourced corpus of spontaneous speech: Ask volunteers to respond to open-ended questions, automatically transcribe responses, then use volunteers to 1) correct and 2) validate the transcriptions. A similar approach could be used with public domain podcasts, but not sure there are many of those out there…

I’d been wondering about spontaneous speech approaches myself, as I know my own voice has markedly different intonation when reading vs improvising and I’m sure it’ll impact recognition.

Something like @tinok suggests seems interesting. I wonder if having some kind of question and answer quiz-like approach might work. You’d want the questions to be super easy (so everyone can respond). Also it might be possible to boost the transcription quality because the answers could be expected to be from a fairly narrow domain - eg “what type of pet did you have when you were young?” will yield a fairly narrow list of animals with a few standard openers (“we had a”, “I had a”, “in our family we had” etc etc) and you could feed any new responses into the list once validated.
Effectively it would be a bit like the data used for Family Feud (Family Fortunes in the UK) https://en.m.wikipedia.org/wiki/Family_Feud

@nmstoker, I would suggest questions that require longer responses. Short response questions are likely to elicit one word or very short phrases (‘a dog’, ‘we had no pets’, ‘a cat and a goldfish’). Examples of what I’m proposing would be:

  • Can you tell me in detail what you had for breakfast, lunch, and dinner yesterday?

  • What are your thoughts about the Brexit [or insert any number of current issues] debate?

  • Would you ever consider moving to another country? Why/why not?

  • What role do you think robots may play in your personal life in 10 years?

I can easily come up with 100 questions along these lines.

How could this work in terms of UX?

Record responses

  1. I suggest having a separate workflow where volunteers are asked to respond to questions. Alternatively, we could display questions among current phrases, with the explicit instruction to record a response rather than reading the phrase.

  2. Randomly show users five questions to choose from. After choosing a question, start recording. Show a countdown from 30 seconds to indicate that the answer should be relatively short.

  3. After recording their answer, they could choose from three different questions, minus questions already answered. Forcing them to answer all questions before seeing a new set (as it’s currently done for reading phrases) might lower participation: Not all questions may be equally relevant or interesting or understood by volunteers.

  4. Nice to have: Add the questions themselves to the current database and play back recordings of the questions in this new feature instead of showing the question written on the screen. This would mirror a more natural conversation/interview style.

  5. [Responses to the questions should be automatically transcribed with DeepSpeech and saved to the database]

Validate transcriptions

  1. Users can choose to validate transcriptions. They are then shown the original question as well as the text of the transcription and a play button. After clicking the play button the volunteer will hear the original recording.

  2. User has three options: A) Validate transcription, B) Correct transcription, C) Mark as inaudible.

  3. If choosing A or C, they will see the next question and transcription to be validated.

  4. If choosing B, user is able to edit the transcription. After editing, she can save it and will be shown the next question.

Obviously this would require some changes in terms of UI, features, and backend. I’m able to contribute with more detailed scoping and inform UX and UI options, but only limited coding.

I like the approach, I guess it hinges on how well DeepSpeech (trained with read text) copes with spontaneous open ended text.

If it copes reasonably well then the validation process can deal with the odd little mistranscription, but if it’s too bad, there would be a risk of frustrating volunteers - even a WER around 90% combined with people being less “tidy” with their speech as they ad lib will wear pretty quickly!

By coincidence, Amazon just announced a paper discussing a technique to reduce error rates for narrow domains of the kind I was suggesting above:

1 Like