Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Sanghyun Lee; David K Han; Hanseok Ko

doi:10.1109/ACCESS.2021.3092735

Back

Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Journal article

Open access

Peer reviewed

Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Sanghyun Lee, David K Han and Hanseok Ko

IEEE access, v 9, pp 94557-94572

2021

DOI: https://doi.org/10.1109/ACCESS.2021.3092735

Files and links (2)

url

https://doi.org/10.1109/access.2021.3092735View

Published, Version of Record (VoR)CC BY V4.0, Open

url

https://doi.org/10.1109/ACCESS.2021.3092735View

Published, Version of Record (VoR) Open

Abstract

attention based multimodal

BERT

Bit error rate

Computer architecture

Deep learning

Emotion recognition

Feature extraction

heterogeneous features

Multimodal emotion recognition

Sentiment analysis

transformer

Visualization

Human communication includes rich emotional content, thus the development of multimodal emotion recognition plays an important role in communication between humans and computers. Because of the complex emotional characteristics of a speaker, emotional recognition remains a challenge, particularly in capturing emotional cues across a variety of modalities, such as speech, facial expressions, and language. Audio and visual cues are particularly vital for a human observer in understanding emotions. However, most previous work on emotion recognition has been based solely on linguistic information, which can overlook various forms of nonverbal information. In this paper, we present a new multimodal emotion recognition approach that improves the BERT model for emotion recognition by combining it with heterogeneous features based on language, audio, and visual modalities. Specifically, we improve the BERT model due to the heterogeneous features of the audio and visual modalities. We introduce the Self-Multi-Attention Fusion module, Multi-Attention fusion module, and Video Fusion module, which are attention based multimodal fusion mechanisms using the recently proposed transformer architecture. We explore the optimal ways to combine fine-grained representations of audio and visual features into a common embedding while combining a pre-trained BERT model with modalities for fine-tuning. In our experiment, we evaluate the commonly used CMU-MOSI, CMU-MOSEI, and IEMOCAP datasets for multimodal sentiment analysis. Ablation analysis indicates that the audio and visual components make a significant contribution to the recognition results, suggesting that these modalities contain highly complementary information for sentiment analysis based on video input. Our method shows that we achieve state-of-the-art performance on the CMU-MOSI, CMU-MOSEI, and IEMOCAP dataset.

Metrics

57 Record Views

40 citations in Web of Science

72 citations in Scopus

Details

Title: Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification
Creators: Sanghyun Lee - School of Electrical Engineering, Korea University, Seoul, South Korea
David K Han - Drexel University
Hanseok Ko - Korea University
Publication Details: IEEE access, v 9, pp 94557-94572
Publisher: IEEE
Grant note: -10073162 / Technology Innovation Program (or Industrial Strategic Technology Development Program, Development of Robot’s Natural Language Recognition and Emotional Dialogue Technology) through the Ministry of Trade, Industry and Energy (MOTIE), South Korea (10.13039/501100003662)
Resource Type: Journal article
Language: English
Academic Unit: Electrical and Computer Engineering
Web of Science ID: WOS:000674231500001
Scopus ID: 2-s2.0-85112214088
Other Identifier: 991019168544904721

InCites Highlights

Data related to this publication, from InCites Benchmarking & Analytics tool:

Collaboration types: Domestic collaboration; International collaboration
Web of Science research areas: Computer Science, Information Systems; Engineering, Electrical & Electronic; Telecommunications

Multimodal Emotion Recognition Fusion Analysis Adapting BERT With Heterogeneous Feature Unification

Files and links (2)

Abstract

Metrics

Details

InCites Highlights

Drexel University Social media