Music-Video Emotion Analysis Using Late Fusion of Multimodal
Abstract
Music-video emotion is a high-level semantics of human internal feeling through singing music lyrics, musical instrument performance and visual expression. Any online and off-line music video are rich source to analysis the emotion using modern machine learning technologies. In this research we make music-video emotion dataset and extract music and video features from pre-trained neural networks. Two pre-trained audio and video networks are first fine-tuned and then extract the low level and high-level features. The music network use 2D convolutional network and video network use 3D convolution (C3D). Finally, we concatenate each music and video feature by preserving the time varying features. The long short-term memory (LSTM) network is used for long-term dynamic feature characterization and then various machine learning algorithms evaluates emotions. We also use late fusion to fuse the learned features of audio and video network. The proposed network performs better for music-video emotion analysis.
Keywords
Music-video dataset, Multimedia emotion, Audio-video multimodal, Late fusion
DOI
10.12783/dtcse/iteee2019/28738
10.12783/dtcse/iteee2019/28738
Refbacks
- There are currently no refbacks.