Automatic Emotion Recognition From Multimodal Information Fusion Using Deep Learning Approaches
MetadataShow full item record
During recent years, the advances in computational and information systems have contributed to the growth of research areas, including a ective computing, which aims to identify the emotional states of humans to develop di erent interaction and computational systems. For doing so, emotions have been characterized by speci c kind of data including audio, facial expressions, physiological signals, among others. However, the natural response of data to a single emotional event suggests a correlation in different modalities when it achieves a maximum peak of expression. This fact could lead the thinking that the processing of multiple data modalities (multimodal information fusion) could provide more learning patterns to perform emotion recognition. On the other hand, Deep Learning strategies have gained interest in the research community from 2012, since they are adaptive models which have shown promising results in the analysis of many kinds of data, such as images, signals, and other temporal data. This work aims to determine if information fusion using Deep Neural Network architectures improves the recognition of emotions in comparison with the use of unimodal models. Thus, a new information fusion model based on Deep Neural Network architectures is proposed to recognize the emotional states from audio-visual information. The proposal takes advantage of the adaptiveness of the Deep Learning models to extract deep features according to the input data type. The proposed approach was developed in three stages. In a rst stage, characterization and preprocessing algorithms (INTERSPEECH 2010 Paralinguistic challenge features in audio and Viola Jones face detection in video) were used for dimensionality reduction and detection of the main information from raw data. Then, two models based on unimodal analysis were developed for processing audio and video separately. These models were used for developing two information fusion strategies: a decision fusion and a characteristic fusion model, respectively. All models were evaluated using the eNTERFACE database, a well-known public audiovisual emotional dataset, which allows compare results with state of the art methods. Experimental results showed that Deep Learning approaches that fused the audio and visual information outperform the unimodal strategies.