Deep Learning for Music Classification
2pm
Room 5566 (Lifts 27-28), 5/F Academic Building, HKUST

Supporting the below United Nations Sustainable Development Goals:支持以下聯合國可持續發展目標:支持以下联合国可持续发展目标:

Examination Committee

Prof Bertram SHI, ECE/HKUST (Chairperson)
Prof Pascale FUNG, ECE/HKUST (Thesis Supervisor)
Prof Michael WANG, ECE/HKUST

Abstract

Automatic music recommendation requires music information retrieval tasks ranging from classifying genre and mood, to artist relatedness. In this thesis, we propose using deep learning models for music classification tasks and provide both the motivation and analysis of why deep learning models are suitable for these tasks.

Similar to many tasks that require supervised learning, music information retrieval required an elaborate feature engineering procedure that depends heavily on human labor. Many previous research has been focused on discovering the most "salient" features for music. Such quest has been proven to be challenging. In this thesis, we investigate neural networks in order to explore its ability to automatically engineer features without human intervention. Convolutional Neural Networks have been shown to classify images directly from pixels in the image recognition area. It has been shown that "neurons' in an CNN can be driven by the raw input data adapt the final model to a desirable state, in a supervised learning paradigm.


In this thesis, we show for the first time that deep learning models can learn directly from raw time-domain audio data and bypass both signal processing and feature engineering steps in music classification tasks. We carry out our experiments both on audio clips and lyrics. We compare the results from models learned from raw data to those from feature-engineered data, showing slightly better results with raw audio input. Moreover, we give a full analysis on the output of each convolution layer showing that the CNN is indeed performing signal processing and feature engineering automatically. We also show that CNN with word embeddings can be used to directly classify lyrics from words. Our results on using CNN to classify music from raw time-domain audio data is later on applied to a task of speech emotion recognition, showing its generality.

講者/ 表演者:
Mr Yan WAN
語言
英文
新增活動
請各校內團體將活動發布至大學活動日曆。