Voice recognition and emotional rescue

Voice recognition and emotional rescue: Hitting the Right Chords in Speech Synthesis

Or: Hitting the Right Chords in Speech Synthesis

The interaction between man and machine is increasingly shifting from 1970s style fixed terminals, keyboards and displays to a wider presence of sensors and output generators such as smartphones and connected speakers. It’s fair to say the human voice is becoming the most natural way to communicate. Sentiment and emotions make it easier to interpret ambiguous sentences, otherwise it would be hard to explain the rise of emojis. It lets you add a little wink to the sentence “You’re dumb” to turn an insult into a term of endearment or compliment.

The battle between emotions and logic

“I think, therefore I am.” The famous dictum by René Descartes (1596 - 1650) has for centuries influenced scientists and thinkers alike. He is one of the most prominent proponents of the dualism between thinking and feeling. Not emotions create human identity and intellect, he argued, but our capability to reason. Feelings can safely be neglected as we were all Vulcans. Researchers, however, have been trying to correct this perception long before Star Trek characters like Mr. Spock and later Data entered the stage. They find mounting evidence that intellectual intelligence isn’t worth much without accompanying emotional intelligence. One of them is Antonio R. Damasio, professor of neuroscience, psychology and philosophy at the University of Southern California and the director of the USC Brain and Creativity Institute.

Descartes’ error, he argues, “consisted of fundamentally severing the mind from the body and assuming that thought happens separately from the body and is the actual substrate of the self.” Damasio postulates three theses to counter Cartesian dualism: [1]

  • reason depends on our ability to feel emotions;
  • feelings are perceptions of the physical landscape;
  • the body is the frame of reference for all neural processes.

Depending on the school of thought, experts distinguish between four or more major emotions. The “Big Four” are anger, sorrow, joy and fear. They are universal across all cultures and arose through evolution. Being able to feel those emotions and to express them helps us cope as individual members of a society and survive as mankind.

The American psychologist Paul Ekman (born Feb. 15, 1934) has pioneered research into emotions and their relationship with facial expressions. Ekman has created an “Atlas of Emotions” that maps out more than 10,000 facial expressions, establishing his reputation as the world’s best human lie detector. Based on his research on typical facial expressions, Ekman has identified six universal emotions known as the “Ekman Big Six:” wrath, joy, sorrow or loneliness, disgust, surprise or shock.

What the world needs now is … emotions

Emotions then play a crucial part in communicating and facilitate the process. Sometimes they’re the precondition to even establishing communication. Words are more than just strings of letters, and human language is much more than clearly pronouncing and stringing together several words.

Human, verbal communication is at least as much shaped by the emotional expressions in the voice of the speaker. A second information channel is wrapped around our utterances to signal whether the message is important, meant seriously and whether the speaker is happy or sad. A message can only partially be understood without conveying the context and sentiment of the speaker.

If we want to interact with artificial intelligence (AI) in a more natural and intuitive way, it’s therefore crucial that the AI is capable of understanding the contextual emotions of its counterpart as well as the semantic message itself. It also has to react appropriately when it comes to the content and the the type of reaction. And its reaction has to adhere to learned conventions. If an angry customer calls a hotline, she expects understanding and that her problem will be solved, not an AI that pokes fun at her.

Using emotional speech recognition and synthesis

One possible use case scenario is a tiered deployment while on hold. The AI recognizes who’s angry and routes them evenly to the human agents best suited or trained from a psychological point of view. When monitoring social media, it could also be helpful to recognize sentiments such as irony or sarcasm in order to understand that what sounds like praise is in fact harsh criticism. In a smart home environment, a computer could adjust the environment to the resident’s mood.
It would be a mistake, though, if machines suddenly displayed an emotional response because humans wouldn’t be able to cope with the abrupt change.  We’ve been trained to be brief and concise when dealing with a digital assistant, and nobody expects Siri or Alexa [SH: German original has “Echo,” but it should be Alexa, since Echo is the HW.] to understand irony or sarcasm. But it’s also true that these digital helpers already react emotionally appropriate in some situations. If we berate a voice assistant, we’re often carefully reprimanded and don’t get the search results that a more neutral verbal input would trigger.

The future’s emotional

It’s only a matter of time until this interaction will evolve as research will develop options for emotional speech synthesis. That will make it possible, for instance, to adjust the urgency in automatic status announcements. Is the problem with a car minor or should you immediately pull over? You’ll be able to tell how bad things are by listening to the announcement. Artificial agents would be more credible if they had an emotional expressiveness in their voice while respecting social norms. “Special offers” could then be presented as something truly special, and automated apologies would really sound apologetic.  
Emotional speech synthesis could open up a whole new world of opportunities in healthcare, too, for example helping aphasia patients who suffer from neurologically caused speech loss.

Approaches to emotional speech synthesis

There are as many paths to emotional speech synthesis as there are algorithms. But basically, we can distinguish four different ones:

  1. Articulatory synthesis: The acoustic specifics of the human speech apparatus are mathematically modeled. Emotional states of the speaker are represented directly through muscle tension.

  2. Formant synthesis: The human speech apparatus is modeled through digital circuits. The challenge here lies in capturing the dynamics of speaking. A model for ”emotionally neutral” speech has to be adapted in a way so it can simulate emotional excitement in the speech output.

  3. Data-based synthesis: Artificial language is generated by stringing together existing speech samples and manipulating the output signal as little as possible to avoid artefacts. The emotional expressions then have to be part of the database and need to be marked accordingly.

  4. Neural network-based synthesis: This type of synthesis is a special case of data-based synthesis. Here, too, the emotions and speaking style have to be marked up in the training data base.

Beware of irony, really!

Human communication has some tricks up its sleeve and poses a special challenge to machines. We often say one thing and mean the exact opposite. Irony and its siblings sarcasm, cynicism and sardonicism aren’t always easy to spot, but have to be learned through acculturation. Children don’t initially understand that saying “what a nice day!” with a raised eyebrow means the opposite, and that screaming “awesome muffin!” doesn’t mean we’re praising an outstanding piece of pastry. Children, in other words, do not understand the tonality of what’s said and instead take words at face value.
The same is true for machines, leading an AI to seriously misinterpret the content of a message. Deutsche Telekom has in cooperation with audEERING developed a technology demonstrator to better tackle this problem of understanding. The demonstrator records a speech sample, analyzes it and outputs values for

  • vocal excitement (activity) 

  • vocal well-being (valenz)
  • textual sentiment: positive - negative - neutral

If the values for vocal well-being and text sentiment don’t match, an irony alarm bell goes off. If the system recognizes anger, it sounds an anger alarm. In short, the AI can assess verbal communications and react accordingly. Processing emotions will revolutionize the interaction between man and machine. Taking into account nonverbal expressions is a natural precondition to make it easier for humans and automated systems to converse with each other.



[1] Spektrum.de: Descartes' Irrtum. Fühlen, Denken und das menschliche Gehirn.
1997-05-01, (accessed 2017-08-23)

Add a comment

your browser is not up to date
to enjoy this website you will need to install a modern browser.
we recommend to update your browser and to install the latest version.

iOS users, please male sure you're running at least iOS 9.

Mozilla Firefox Google Chrome Microsoft Edge Internet Explorer