The Speech Emotion Recognition Crowdsourcing Project

SpeechEmotionRecognition.xyz

The reason for the creation of this website is to address to actors and native speakers to contribute to research on spoken emotion by recording small emotional speech excerpts. We have made a selection of phrases from theatrical plays, that have an elusive emotional content. This means that it is possible to say the same phrase in different sentimental context. Since it is often difficult to perceive correctly the intended emotion, we ask the visitor to select an emotion from the list before recording. The goal is to gather a significant amount of recordings, from different speakers. Every recording will be anonymized. Anyone who records can contact us to add her/his name to the contributors' list. We will provide the dataset publically so that all the academic world (computer scientists, artists, phsychologists, doctors) can benefit from this project.

Why should I contribute?

Besides helping scientific research, according to our first contributors it is also a fun activity!
Actors can use it as a theatrical excercise. This is why we have carefully chosen some excerpts from theatrical context. Of course, you can always ignore our proposed phrases and just improvise! Furthermore, by downloading your recordings, and having your name in the contributors' list online, you can enrich your artistic portfolio.
If you are an acting educator, you can suggest it to your team. Besides submitting the recordings to our server, it is possible to download them locally for your own use.

Speech Emotion Recognition

Speech Emotion Recognition (SER) is the process of extracting emotional paralinguistic information from speech. It is a field with growing interest and potential applications in Human-Computer Interaction, content management, social interaction, and as an add-on module in Speech Recognition and Speech-To-Text systems. Text-independent automated SER relies on the specific attributes of speech audio signals. In such an elusive task as SER, typically a data-driven approach is followed. This means that models are trained on data. As a consequence, the performance of SER models is inextricably linked to the quality and the organization of the provided dataset.

Acted Emotional Speech Dynamic Dataset - AESDD

Databases of emotional speech are divided into two main categories, the ones that contain utterances of acted emotional speech and the ones that contain spontaneous emotional speech. Both categories have benefits and limitations. The Acted Emotional Speech Dynamic Database (AESDD) contains utterances of acted emotional speech in the Greek language. The motive for the creation of the database was the absence of a publically available high-quality database for SER in Greek, a realization made during the research on an emotion-triggered lighting framework for theatrical performance [1]. The database utterances with five emotions: anger, disgust, fear, happiness, and sadness. The first version of the database was created in collaboration with a group of professional actors, who showed vivid interest in the proposed framework. Dynamic (in AESDD) refers to the intention of constantly expanding the database through the contribution of actors and performers that are involved, or interested in the project. While the call for contribution addresses to actors, the SER models that are trained on the AESDD are not exclusively performance-oriented. The first version of the AESDD was presented in [2]. In [4], subjective evaluation experiments were carried out on the database, to assess human accuracy in recognizing the intended emotion in AESDD utterances. The accuracy of human listeners was estimated at around 74%.

Publications

Vryzas, N., Liatsou, A., Kotsakis, R., Dimoulas, C., & Kalliris, G. (2017, August). Augmenting Drama: A Speech Emotion-Controlled Stage Lighting Framework. In Proceedings of the 12th International Audio Mostly Conference on Augmented and Participatory Sound and Music Experiences (p. 8). ACM.
Vryzas, N., Kotsakis, R., Liatsou, A., Dimoulas, C. A., & Kalliris, G. (2018). Speech emotion recognition for performance interaction. Journal of the Audio Engineering Society, 66(6), 457-467.
Vryzas, N., Vrysis, L., Kotsakis, R., & Dimoulas, C. (2018, September). Speech emotion recognition adapted to multimodal semantic repositories. In 2018 13th International Workshop on Semantic and Social Media Adaptation and Personalization (SMAP) (pp. 31-35). IEEE.
Vryzas, N., Matsiola, M., Kotsakis, R., Dimoulas, C., & Kalliris, G. (2018, September). Subjective Evaluation of a Speech Emotion Recognition Interaction Framework. In Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion (p. 34). ACM.
Vryzas, N., Vrysis, L., Matsiola, M., Kotsakis, R., Dimoulas, C., & Kalliris, G. (2020). Continuous Speech Emotion Recognition with Convolutional Neural Networks. Journal of the Audio Engineering Society, 68(1/2), 14-24.