Description
We released a suit of several models for emotion detection in voice and faces. Empathic Insight Voice can detect 54 scores that describe human voices. It surpasses Hume API and Gemini 2.5 on our new psychology expert annotated benchmark EmoNet-Voice. I think it be cool to condition text-to-speech models on the 54 scores to give fine-grained control over emotions and other properties like harsh/soft, warm/cold. calm/aroused, ...
It might also be very nice to use our Bud-E whisper to condition text-to-speech on free-form captions that describe the emotions.
https://x.com/laion_ai/status/1935792645143494926
https://laion.ai/blog/do-they-see-what-we-see/
https://huggingface.co/laion/Empathic-Insight-Face-Small
https://huggingface.co/laion/BUD-E-Whisper
Read the Papers
EmoNet Face: https://arxiv.org/abs/2505.20033
EmoNet Voice: https://arxiv.org/abs/2506.09827