🎛️ EmoKnob Enhance Voice Cloning with Fine-Grained Emotion Control

Haozhe Chen, Run Chen, and Julia Hirschberg

Columbia University

PaperGitHubDemo
EmoKnob Teaser

Abstract

While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.

Method

EmoKnob Method
  1. Emotion Direction Extraction: We extract an emotion direction vector from a few demonstrative emotional speech samples and their neutral counterparts.
  2. Apply Emotion Control: We apply the extracted emotion direction to a target speech with controllable strength, allowing fine-grained emotion adjustment.
Retrieval-based Emotion Control

The capability to apply emotion control with few-shot samples allows us to develop systems that apply emotion control based on open-ended text descriptions. We developed two systems:

  1. Retrieval-based Emotion Control: This system retrieves the most similar emotion from a pre-defined set of samples based on the input text description and transcripts.
  2. Synthetic-data-based Emotion Control: This system uses emotional text to generate emotional speech with existing TTS models and transfer the emotion to emotion control.

Example Clips

All of voice clone and voice clone enhaned with emotion control on this page are obtained from cloning this clip.

Emotion Enhancement (Simple Emotions)

This section shows examples of enhancing an emotional text with fine-grained emotion control of the corresponding emotion on simple emotions. Number in parentheses is the strength of the emotion control. All emotion direction vectors are obtained with single-shot (one pair of emotional and neutral clip).

Original
Clone
Angry
(0.1)
Angry
(0.2)
Angry
(0.3)
Angry
(0.4)
Angry
(0.5)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Happy
(0.1)
Happy
(0.2)
Happy
(0.3)
Happy
(0.4)
Happy
(0.5)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Sad
(0.1)
Sad
(0.2)
Sad
(0.3)
Sad
(0.4)
Sad
(0.5)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Surprise
(0.1)
Surprise
(0.2)
Surprise
(0.3)
Surprise
(0.4)
Surprise
(0.5)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Contempt
(0.1)
Contempt
(0.2)
Contempt
(0.3)
Contempt
(0.4)
Contempt
(0.5)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Disgust
(0.1)
Disgust
(0.2)
Disgust
(0.3)
Disgust
(0.4)
Disgust
(0.5)
Sentence 1
Sentence 2
Sentence 3
Sentence 4

Emotion Selection (Simple Emotions)

This section shows examples of selecting emotions for a neutral text with simple emotions. While usual TTS frameworks speech emotions is entirely decided by text All emotion direction vectors are obtained with single-shot (One pairs of emotional and neutral clip).

Original
Happy
Sad
Surprise
Contempt
Disgust
Sentence 1
Sentence 2
Sentence 3
Sentence 4

Emotion Enhancement (Complex Emotions)

This section shows examples of enhancing an emotional text with fine-grained emotion control of the corresponding emotion on more complex emotions. Number in parentheses is the strength of the emotion control. All emotion direction vectors are obtained with two-shot (Two pairs of emotional and neutral clip).

Original
Clone
Empathy
(0.1)
Empathy
(0.2)
Empathy
(0.3)
Empathy
(0.4)
Empathy
(0.5)
Empathy
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Charisma
(0.1)
Charisma
(0.2)
Charisma
(0.3)
Charisma
(0.4)
Charisma
(0.5)
Charisma
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4

Emotion Selection (Complex Emotions)

This section shows examples of selecting emotions for a neutral text with complex emotions. While usual TTS frameworks speech emotions is entirely decided by text All emotion direction vectors are obtained with two-shot (Two pairs of emotional and neutral clip).

Original
Empathy
Charisma
Sentence 1
Sentence 2
Sentence 3
Sentence 4

Emotion Enhancement (Open-Ended Text Emotion Description)

This section shows examples of enhancing an emotional text with fine-grained emotion control based on an open-ended text description of the emotion with our synthetic data-based and retrieval-based method. Previously, without avaibility of large datasets for these emotions, it is not possible to apply emotion controls for these emotions. Our open-ended text to emotion framework based on retrieval and synthetic data makes emotion control for these emotions possible. Number in parentheses is the strength of the emotion control. All emotion direction vectors are obtained with five-shot (Five pairs of emotional and neutral clip).

Original
Clone
Romance
(0.1)
Romance
(0.2)
Romance
(0.3)
Romance
(0.4)
Romance
(0.5)
Romance
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Desire
(0.1)
Desire
(0.2)
Desire
(0.3)
Desire
(0.4)
Desire
(0.5)
Desire
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Sarcasm
(0.1)
Sarcasm
(0.2)
Sarcasm
(0.3)
Sarcasm
(0.4)
Sarcasm
(0.5)
Sarcasm
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Envy
(0.1)
Envy
(0.2)
Envy
(0.3)
Envy
(0.4)
Envy
(0.5)
Envy
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Curious, Intrigued
(0.1)
Curious, Intrigued
(0.2)
Curious, Intrigued
(0.3)
Curious, Intrigued
(0.4)
Curious, Intrigued
(0.5)
Curious, Intrigued
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Blaming
(0.1)
Blaming
(0.2)
Blaming
(0.3)
Blaming
(0.4)
Blaming
(0.5)
Blaming
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Grateful, Appreciative, Thankful, Indebted, Blessed
(0.1)
Grateful, Appreciative, Thankful, Indebted, Blessed
(0.2)
Grateful, Appreciative, Thankful, Indebted, Blessed
(0.3)
Grateful, Appreciative, Thankful, Indebted, Blessed
(0.4)
Grateful, Appreciative, Thankful, Indebted, Blessed
(0.5)
Grateful, Appreciative, Thankful, Indebted, Blessed
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4
Original
Clone
Desire and Excitement
(0.1)
Desire and Excitement
(0.2)
Desire and Excitement
(0.3)
Desire and Excitement
(0.4)
Desire and Excitement
(0.5)
Desire and Excitement
(0.6)
Sentence 1
Sentence 2
Sentence 3
Sentence 4

Emotion Selection (Open-Ended Text Emotion Description)

This section shows examples of selecting emotions for a neutral text with open-ended text descriptions of emotions. While usual TTS frameworks speech emotions is entirely decided by text All emotion direction vectors are obtained with five-shot (Five pairs of emotional and neutral clip).

Original
Romance
Desire
Sarcasm
Envy
Curious, Intrigued
Blaming
Grateful, Appreciative, Thankful, Indebted, Blessed
Desire and Excitement
Sentence 1
Sentence 2
Sentence 3
Sentence 4

Application: Make a Non-Empathetic Speaker Empathetic

In this section, we show how we can use EmoKnob to make a non-empathetic speaker empathetic. Emotion direction is obtained with single-shot (one pair of empathetic and neutral clip).

Original Speaker (Contains Vulgar Language)
Original Clone
Empathetic 1
Empathetic 2

Citation

If you find this work useful, please consider citing our paper:

@misc{chen2024emoknobenhancevoicecloning,
      title={EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control},
      author={Haozhe Chen and Run Chen and Julia Hirschberg},
      year={2024},
      eprint={2410.00316},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.00316},
}