While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.
The capability to apply emotion control with few-shot samples allows us to develop systems that apply emotion control based on open-ended text descriptions. We developed two systems:
All of voice clone and voice clone enhaned with emotion control on this page are obtained from cloning this clip.
This section shows examples of enhancing an emotional text with fine-grained emotion control of the corresponding emotion on simple emotions. Number in parentheses is the strength of the emotion control. All emotion direction vectors are obtained with single-shot (one pair of emotional and neutral clip).
This section shows examples of selecting emotions for a neutral text with simple emotions. While usual TTS frameworks speech emotions is entirely decided by text All emotion direction vectors are obtained with single-shot (One pairs of emotional and neutral clip).
This section shows examples of enhancing an emotional text with fine-grained emotion control of the corresponding emotion on more complex emotions. Number in parentheses is the strength of the emotion control. All emotion direction vectors are obtained with two-shot (Two pairs of emotional and neutral clip).
This section shows examples of selecting emotions for a neutral text with complex emotions. While usual TTS frameworks speech emotions is entirely decided by text All emotion direction vectors are obtained with two-shot (Two pairs of emotional and neutral clip).
This section shows examples of enhancing an emotional text with fine-grained emotion control based on an open-ended text description of the emotion with our synthetic data-based and retrieval-based method. Previously, without avaibility of large datasets for these emotions, it is not possible to apply emotion controls for these emotions. Our open-ended text to emotion framework based on retrieval and synthetic data makes emotion control for these emotions possible. Number in parentheses is the strength of the emotion control. All emotion direction vectors are obtained with five-shot (Five pairs of emotional and neutral clip).
This section shows examples of selecting emotions for a neutral text with open-ended text descriptions of emotions. While usual TTS frameworks speech emotions is entirely decided by text All emotion direction vectors are obtained with five-shot (Five pairs of emotional and neutral clip).
In this section, we show how we can use EmoKnob to make a non-empathetic speaker empathetic. Emotion direction is obtained with single-shot (one pair of empathetic and neutral clip).
If you find this work useful, please consider citing our paper:
@misc{chen2024emoknobenhancevoicecloning, title={EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control}, author={Haozhe Chen and Run Chen and Julia Hirschberg}, year={2024}, eprint={2410.00316}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2410.00316}, }