Music-Video Emotion Analysis Using Late Fusion of Multimodal

Yagya Raj PANDEYA, Joonwhoan LEE

Abstract


Music-video emotion is a high-level semantics of human internal feeling through singing music lyrics, musical instrument performance and visual expression. Any online and off-line music video are rich source to analysis the emotion using modern machine learning technologies. In this research we make music-video emotion dataset and extract music and video features from pre-trained neural networks. Two pre-trained audio and video networks are first fine-tuned and then extract the low level and high-level features. The music network use 2D convolutional network and video network use 3D convolution (C3D). Finally, we concatenate each music and video feature by preserving the time varying features. The long short-term memory (LSTM) network is used for long-term dynamic feature characterization and then various machine learning algorithms evaluates emotions. We also use late fusion to fuse the learned features of audio and video network. The proposed network performs better for music-video emotion analysis.

Keywords


Music-video dataset, Multimedia emotion, Audio-video multimodal, Late fusion


DOI
10.12783/dtcse/iteee2019/28738

Refbacks

  • There are currently no refbacks.