SlideShare a Scribd company logo
1 of 14
Download to read offline
Building a model to quantify image and music correlation for an Artificial
Intelligence based recommendation service
BUS 697E – Directed Study, Fall 2017
Vishal Changrani, EvMBA 2018
Faculty advisor: Prof. Tom Smith
11/14/2017
Contents
Executive Summary...................................................................................................................................3
Introduction ..............................................................................................................................................3
Existing research .......................................................................................................................................3
Music features ......................................................................................................................................4
Image features......................................................................................................................................4
Building a music classifier .........................................................................................................................5
Data gathering ......................................................................................................................................6
Feature extraction.................................................................................................................................6
Music selection for survey ....................................................................................................................6
Analyzing survey result .........................................................................................................................7
Results...................................................................................................................................................9
Future research.......................................................................................................................................10
References ..............................................................................................................................................11
Appendix A Survey screenshots..............................................................................................................12
Welcome screen .................................................................................................................................12
Check audio screen .............................................................................................................................12
Sample Question screen .....................................................................................................................13
After submission screen......................................................................................................................13
Appendix B – Python notebooks.............................................................................................................14
Figure 1 Overall method ...............................................................................................................................5
Figure 2 Unsupervised hierarchical clustering..............................................................................................7
Figure 3 Classification method [Source: 10] .................................................................................................8
Figure 4 Confusion Matrix.............................................................................................................................9
Figure 5 Feature Importance ......................................................................................................................10
Executive Summary
➢ Existing research in the field of image and music correlation, music mood classification and
image impression helped identify a list of features that can be used to build a classifier.
➢ A classifier was built using these features to identify the mood of a given piece of music using
existing data and new data that was collected by running a survey.
➢ The classifier had an overall accuracy of 31% and a precision of 0.8 for music that was classified
as ‘sad’.
Introduction
As sentient beings our consciousness is supported by all our five senses working together creating a
holistic impression of our world such that the whole is greater than the sum of each. Hence it is not
surprising that when we see an image its emotional impact on our mind is exaggerated if combined with
music or the impact of the written word is heightened if it is overlaid on an image. If one can quantify
this seemingly subjective change in perception caused by mixing different mediums with reasonable
accuracy by a predictive model, then that model can be used in existing media and entertainment
related products and in advertisements. A recommendation service can also be built on top of this
model which can be monetized using different business models.
There has already been a lot of research in the field of perceptual psychology, advertising and
information technology to quantify this interaction between the visual and the auditory sensory
modalities. This report delineates some of this research. It also summarizes an attempt to create a
classifier for music which predicts the impression of the music on the listener and finally lists areas of
future research that may be pursued.
Existing research
Both image and music elicit an emotional response from us. These human emotions can be classified
using simple labels – sad, happy, angry, bright, dull etc. Reference [1] provides a great starting point on
how the interaction between music and images and their emotional response may be quantified using
some of the physical features of the medium. For images it uses features such as RGB values, HSI values
and traverse lines and for music it uses features such as volume, pitch and timbre. It demonstrates a
way in which by conducting simple experiments, a model can be built which predicts the effect of music
on the emotional impression of an image. It concludes that color information of the images they
considered were strongly correlated with adjectives expressing “potency and activity,” and the entropy
of saturation was correlated with words expressing spatial extent. Similarly, the physical properties
representing the power of the music were related to impression words expressing “potency and
activity”.
A presentation in which the audio and visual elements complement each other and enhance the overall
impact is said to have achieved ‘consonance’. For e.g. when a somber piece of music is played with a
somber type of image, the image appears to be even more dull and gloomy. Similarly, when a peppy or
happy piece of music is played with a happy image e.g. an image of a holiday spot, the image may
appear to be even more pleasing. Hence, research which identifies the mood of the music and research
which identifies the impression of an image can be used in tandem to identify music and images which
will produce consonance.
Using the existing research, a simple list of features for music and features for images was created.
Music features
Feature Description Reference
Average tempo as bpm
(beats per minute)
The frequency with which a human would tap their foot while
listening to the same piece of music.
[4]
Zero crossings Time-domain zero-crossings can be used to measure how
noisy is the signal and also somewhat correlates to high-
frequency content. The duration of all the song were same
hence an absolute count was used instead of a rate.
[4]
Spectral centroid A measure of “brightness” of a sound and relates to music
timbre
[5] & [6]
Average Bandwidth Indicator of the spectral range of the interesting parts in the
signal, i.e., the parts around the centroid. The average
bandwidth of a music piece may serve to describe its perceived
timbre
[5] & [6]
MFCC_x and
MFCC_SD_x
Mel Frequency Cepstral
Coefficients and it’s
standard deviation
MFCCs of a signal are a small set of features (usually about 10-
20) which concisely describe the overall shape of a spectral
envelope. It is a measure of the timbre of a piece of music. 12
Mfcc coefficients were derived.
[7]
Chroma_x and
Chroma_SD_x
Average CENS for each
of the 12 semitones and
it’s corresponding
standard deviation.
A chroma vector is a typically a 12-element feature vector
indicating how much energy of each pitch class, {C, C#, D, D#,
E, ..., B}, is present in the signal. It is used for identifying
similarity between two sounds. Chroma energy normalized
statistics (CENS) vector smooths Chroma over local deviations
in tempo, articulation, and musical ornaments such as trills
and arpeggiated chords. The CENS vector is the average of
CENS value for each pitch class: A, A#, B, C, C#, D, D#, E, F, F#,
G, G#, A, A#, B
[7] & [8]
Image features
Feature Description Reference
Mean Hue
HSI and HSV scales are more close to how humans perceive
color.
Mean across all pixels in the image
[1]
Mean saturation Mean across all pixels in the image [1]
Mean intensity Mean across all pixels in the image [1]
Mean value of Red Mean across all pixels in the image [1]
Mean value of Green Mean across all pixels in the image [1]
Mean value of Blue Mean across all pixels in the image [1]
Average RGB entropy
Average entropy can be considered a proxy of how interesting
the image is. Greater the entropy, more interesting the image.
[11]
Direction (gabor filter)
The use of the Gabor filter
makes it possible to see whether the image is marked by
straight lines or transverse lines.
[1]
Dominant color in RGB
The one color that is more prevelant in the image in the RGB
space
[1]
Dominant color in HSV
The one color that is more prevelant in the image in the HSV
space
[1]
Building a music classifier
This section describes the how a music classifier that classified music as either – happy, neutral or sad
was built. Following diagram shows the approach that was followed,
Figure 1 Overall method
Data gathering
Choice of music
There is always a bias associated with the music that we hear. These biases may be due to the memories
that the music evokes or an inherent perception of the artist, lyrics or the genre of the music. Also,
popular music genres such as Hip hop, Jazz, Pop, Rock have a very complex musical structure and
features derived from one such musical piece is not easily comparable to the other.
Hence, I decided to use old classical piano music under the assumption that it would have less bias
associated with it and features derived from the music pieces would be comparable to each other since
only a single instrument, the piano, was used to produce it.
The classical piano music was obtained from [2] in the 44.1KHz 128 kBit/s mp3 format. All the mp3s
were trimmed to retain only the first 30 second similar to the approach taken in [1] which mentions that
we form first impressions of an object in a short time, only a few seconds and also to keep the survey
short to elicit more responses.
There were a total 61 mp3s from 12 different artists such as Bach, Beethoven, Chopin etc. These music
pieces were different parts of the artist symphony and have been translated to a piano format. More on
this process has been described in [3].
Feature extraction
All the features mentioned earlier in the Music features section were extracted from the music using the
Python librosa library. The link to the Python code that was used is available in Appendix B – Python
notebooks.
Music selection for survey
Ideally, each survey participant should have rated each of the 61 pieces. However, that would have
taken more than half an hour to complete each survey. Since the survey was completely voluntary and
no incentive was provided, it was shortened such that each survey only included 10 music files in a hope
to have more responses. However, the Music pieces were presented in a random order in each survey to
remove any relative biases between the music pieces.
Additionally, each survey was designed to be representational of the complete music data set by
identifying clusters of similar music. This was done by using unsupervised hierarchical clustering to
create clusters of music files which were like each other w.r.t to the extracted features. A cluster count
of 4 was chosen based on the following dendrogram and using a cutoff distance of 500000. Then, three
files from cluster 1, three files from cluster 3 and four files from cluster 4 were randomly chosen for
each run of the survey to get ten pieces of music such that they represented the complete dataset in
terms of the features under consideration. (Cluster 2 contained only one file hence was skipped all
together). Appendix A show the screenshots of the survey.
Figure 2 Unsupervised hierarchical clustering
The final list of mp3 were: beethoven_hammerklavier_3, islamei, waldstein_3, alb_se1,
beethoven_les_adieux_1, brahms_opus1_2, mond_3, alb_esp1, bach_847, br_im6.
Analyzing survey result
The survey was run for a period of three weeks. 28 participants completed the survey. Having a small
number of dataset (n=61) and large number of columns (p=52) resulted in the classical ‘small n, large p’
problem. Hence, although the survey asked to rate a piece of music on a bipolar scale with five choices,
‘very sad’, ‘sad, ‘neutral’, ‘happy’ and ‘very happy’, the results were compressed to a bipolar scale of
only three choices: ‘sad’, ‘neutral’, ‘happy’ by changing the ‘very sad’ label to ‘sad’ and the ‘very happy’
label to ‘happy’.
Choice of classifier
The problem at hand could be categorized as a supervised clustering problem. The Random Forest
Classifier [9] was chosen as the classifier for the following reasons,
1. Since the predictive power of each of the individual features were not known upfront, Random
Forest would identify the feature importance.
2. It is an ensemble method hence is theoretically more accurate than using decision trees.
3. The relation between the features and the class label could not be assumed to be linear and
hence logistic regression wouldn’t be suitable.
Train-test split
The results were split into a training set (80%) and a test set (20%). The training set was used to create
the model. After the model was created it was used to predict the class labels for the test set.
Additionally, it was also used the predict the class labels of the training set.
Figure 3 Classification method [Source: 10]
Results
• The overall model accuracy as 31.2%
• The model was a total failure in classifying happy and neutral pieces of music.
• It was however very good at identifying sad pieces of music.
• Recall for ‘happy’ was 0, for ‘neutral’ was 0.16 and for ‘sad’ it was 0.4.
• Precision for ‘happy’ was 0, for ‘neutral’ was 0.25 and for ‘sad’ it was 0.8.
• The following heat map of the Confusion Matrix summarizes this result,
•
Figure 4 Confusion Matrix
• The model identified the following three as the most important features:
1. Average Spectral Centroid
2. MFCC 2
3. Beats per minute
• The following graph summarizes the feature importance as identified by the classifier,
Figure 5 Feature Importance
• As an alternate approach, the MFCC coefficients were not considered as features but that reduces
the accuracy of the classifier a lot.
Future research
Following are areas of future research that will be pursued to gain more insights on the correlation of
music and images,
1. Creating an image mood classifier like the one created for music.
2. Create a classifier which considers features of both image and music.
3. Create a simple recommendation engine which uses these classifiers and additionally uses a
form of collective intelligence by continuously recording responses.
References
1. Sato, K. and Mitsukura, Y. (2013), Effects of Music on Image Impression and Relationship
between Impression and Physical Properties. Electron. Comm. Jpn., 96: 53–61.
doi:10.1002/ecj.11371
2. http://www.piano-midi.de/
3. http://www.piano-midi.de/technic.htm
4. Tao Li, Mitsunori Ogihara, George Tzanetakis (eds.). Music data mining
5. Bojiong Ni, David Wugofski, Zhiming Sh (2016), Video game genre classification using video
game music, Stanford University
(http://cs229.stanford.edu/proj2016/report/NiShiWugofski_FinalReport.pdf)
6. Knees, Peter, Schedl, Markus (2016), Music Similarity and Retrieval - An Introduction to Audio-
and Web-based Strategies.
7. http://musicinformationretrieval.com
8. Vivek Jayaram, Samarth Singal and Saroj Kandel (2015), Auto DJ mixing
(https://github.com/vivjay30/AutoDJ)
9. http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
10. https://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-
and-credibility
11. http://www.astro.cornell.edu/research/projects/compression/entropy.html
Appendix A Survey screenshots
Welcome screen
Check audio screen
Sample Question screen
After submission screen
Appendix B – Python notebooks
https://github.com/vishalchangrani/thougtstream
Feature extraction notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/musicfeatures.ipynb
Unsupervised Hierarchical clustering notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/MusicClustering-hierarchical.ipynb
Unsupervised K-means clustering notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/MusicClustering-kmeans.ipynb
Survey result analysis notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/SurveyResultAnalysis.ipynb

More Related Content

Similar to Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service

Music is a very important feature when it is used in the adv.docx
Music is a very important feature when it is used in the adv.docxMusic is a very important feature when it is used in the adv.docx
Music is a very important feature when it is used in the adv.docx
rosemarybdodson23141
 
Advertising wouldn’t be the same without music. After sever.docx
Advertising wouldn’t be the same without music. After sever.docxAdvertising wouldn’t be the same without music. After sever.docx
Advertising wouldn’t be the same without music. After sever.docx
galerussel59292
 

Similar to Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service (20)

H177 Midterm Nur R
H177 Midterm Nur RH177 Midterm Nur R
H177 Midterm Nur R
 
Poster vega north
Poster vega northPoster vega north
Poster vega north
 
Visual images
Visual imagesVisual images
Visual images
 
AI THROUGH THE EYES OF ORGANISE SOUND
AI THROUGH THE EYES OF ORGANISE SOUNDAI THROUGH THE EYES OF ORGANISE SOUND
AI THROUGH THE EYES OF ORGANISE SOUND
 
Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...
Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...
Design and Analysis System of KNN and ID3 Algorithm for Music Classification ...
 
Writing results and discussion chapters for quantitative research
Writing results and discussion chapters for quantitative researchWriting results and discussion chapters for quantitative research
Writing results and discussion chapters for quantitative research
 
Interactive media arts (1)
Interactive media arts (1)Interactive media arts (1)
Interactive media arts (1)
 
Daily lessons log ENGLISH 5 WEEK 1.docx2
Daily lessons log ENGLISH 5 WEEK 1.docx2Daily lessons log ENGLISH 5 WEEK 1.docx2
Daily lessons log ENGLISH 5 WEEK 1.docx2
 
IRJET- Emotion based Music Recommendation System
IRJET- Emotion based Music Recommendation SystemIRJET- Emotion based Music Recommendation System
IRJET- Emotion based Music Recommendation System
 
Music is a very important feature when it is used in the adv.docx
Music is a very important feature when it is used in the adv.docxMusic is a very important feature when it is used in the adv.docx
Music is a very important feature when it is used in the adv.docx
 
A study of gender specific pitch variation pattern of emotion expression for ...
A study of gender specific pitch variation pattern of emotion expression for ...A study of gender specific pitch variation pattern of emotion expression for ...
A study of gender specific pitch variation pattern of emotion expression for ...
 
Advertising wouldn’t be the same without music. After sever.docx
Advertising wouldn’t be the same without music. After sever.docxAdvertising wouldn’t be the same without music. After sever.docx
Advertising wouldn’t be the same without music. After sever.docx
 
Perception of sounds
Perception of soundsPerception of sounds
Perception of sounds
 
Motion Media Information
Motion Media InformationMotion Media Information
Motion Media Information
 
Transformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speechTransformation of feelings using pitch parameter for Marathi speech
Transformation of feelings using pitch parameter for Marathi speech
 
Music Emotion Classification based on Lyrics-Audio using Corpus based Emotion...
Music Emotion Classification based on Lyrics-Audio using Corpus based Emotion...Music Emotion Classification based on Lyrics-Audio using Corpus based Emotion...
Music Emotion Classification based on Lyrics-Audio using Corpus based Emotion...
 
Nithin Xavier research_proposal
Nithin Xavier research_proposalNithin Xavier research_proposal
Nithin Xavier research_proposal
 
Content-Based Image Retrieval Using Modified Human Colour Perception Histogram
Content-Based Image Retrieval Using Modified Human Colour Perception Histogram Content-Based Image Retrieval Using Modified Human Colour Perception Histogram
Content-Based Image Retrieval Using Modified Human Colour Perception Histogram
 
Sound Events and Emotions: Investigating the Relation of Rhythmic Characteri...
Sound Events and Emotions: Investigating the Relation of Rhythmic Characteri...Sound Events and Emotions: Investigating the Relation of Rhythmic Characteri...
Sound Events and Emotions: Investigating the Relation of Rhythmic Characteri...
 
Question 1
Question 1Question 1
Question 1
 

Recently uploaded

EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
Earley Information Science
 

Recently uploaded (20)

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Evaluating the top large language models.pdf
Evaluating the top large language models.pdfEvaluating the top large language models.pdf
Evaluating the top large language models.pdf
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 

Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service

  • 1. Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service BUS 697E – Directed Study, Fall 2017 Vishal Changrani, EvMBA 2018 Faculty advisor: Prof. Tom Smith 11/14/2017
  • 2. Contents Executive Summary...................................................................................................................................3 Introduction ..............................................................................................................................................3 Existing research .......................................................................................................................................3 Music features ......................................................................................................................................4 Image features......................................................................................................................................4 Building a music classifier .........................................................................................................................5 Data gathering ......................................................................................................................................6 Feature extraction.................................................................................................................................6 Music selection for survey ....................................................................................................................6 Analyzing survey result .........................................................................................................................7 Results...................................................................................................................................................9 Future research.......................................................................................................................................10 References ..............................................................................................................................................11 Appendix A Survey screenshots..............................................................................................................12 Welcome screen .................................................................................................................................12 Check audio screen .............................................................................................................................12 Sample Question screen .....................................................................................................................13 After submission screen......................................................................................................................13 Appendix B – Python notebooks.............................................................................................................14 Figure 1 Overall method ...............................................................................................................................5 Figure 2 Unsupervised hierarchical clustering..............................................................................................7 Figure 3 Classification method [Source: 10] .................................................................................................8 Figure 4 Confusion Matrix.............................................................................................................................9 Figure 5 Feature Importance ......................................................................................................................10
  • 3. Executive Summary ➢ Existing research in the field of image and music correlation, music mood classification and image impression helped identify a list of features that can be used to build a classifier. ➢ A classifier was built using these features to identify the mood of a given piece of music using existing data and new data that was collected by running a survey. ➢ The classifier had an overall accuracy of 31% and a precision of 0.8 for music that was classified as ‘sad’. Introduction As sentient beings our consciousness is supported by all our five senses working together creating a holistic impression of our world such that the whole is greater than the sum of each. Hence it is not surprising that when we see an image its emotional impact on our mind is exaggerated if combined with music or the impact of the written word is heightened if it is overlaid on an image. If one can quantify this seemingly subjective change in perception caused by mixing different mediums with reasonable accuracy by a predictive model, then that model can be used in existing media and entertainment related products and in advertisements. A recommendation service can also be built on top of this model which can be monetized using different business models. There has already been a lot of research in the field of perceptual psychology, advertising and information technology to quantify this interaction between the visual and the auditory sensory modalities. This report delineates some of this research. It also summarizes an attempt to create a classifier for music which predicts the impression of the music on the listener and finally lists areas of future research that may be pursued. Existing research Both image and music elicit an emotional response from us. These human emotions can be classified using simple labels – sad, happy, angry, bright, dull etc. Reference [1] provides a great starting point on how the interaction between music and images and their emotional response may be quantified using some of the physical features of the medium. For images it uses features such as RGB values, HSI values and traverse lines and for music it uses features such as volume, pitch and timbre. It demonstrates a way in which by conducting simple experiments, a model can be built which predicts the effect of music on the emotional impression of an image. It concludes that color information of the images they considered were strongly correlated with adjectives expressing “potency and activity,” and the entropy of saturation was correlated with words expressing spatial extent. Similarly, the physical properties representing the power of the music were related to impression words expressing “potency and activity”. A presentation in which the audio and visual elements complement each other and enhance the overall impact is said to have achieved ‘consonance’. For e.g. when a somber piece of music is played with a
  • 4. somber type of image, the image appears to be even more dull and gloomy. Similarly, when a peppy or happy piece of music is played with a happy image e.g. an image of a holiday spot, the image may appear to be even more pleasing. Hence, research which identifies the mood of the music and research which identifies the impression of an image can be used in tandem to identify music and images which will produce consonance. Using the existing research, a simple list of features for music and features for images was created. Music features Feature Description Reference Average tempo as bpm (beats per minute) The frequency with which a human would tap their foot while listening to the same piece of music. [4] Zero crossings Time-domain zero-crossings can be used to measure how noisy is the signal and also somewhat correlates to high- frequency content. The duration of all the song were same hence an absolute count was used instead of a rate. [4] Spectral centroid A measure of “brightness” of a sound and relates to music timbre [5] & [6] Average Bandwidth Indicator of the spectral range of the interesting parts in the signal, i.e., the parts around the centroid. The average bandwidth of a music piece may serve to describe its perceived timbre [5] & [6] MFCC_x and MFCC_SD_x Mel Frequency Cepstral Coefficients and it’s standard deviation MFCCs of a signal are a small set of features (usually about 10- 20) which concisely describe the overall shape of a spectral envelope. It is a measure of the timbre of a piece of music. 12 Mfcc coefficients were derived. [7] Chroma_x and Chroma_SD_x Average CENS for each of the 12 semitones and it’s corresponding standard deviation. A chroma vector is a typically a 12-element feature vector indicating how much energy of each pitch class, {C, C#, D, D#, E, ..., B}, is present in the signal. It is used for identifying similarity between two sounds. Chroma energy normalized statistics (CENS) vector smooths Chroma over local deviations in tempo, articulation, and musical ornaments such as trills and arpeggiated chords. The CENS vector is the average of CENS value for each pitch class: A, A#, B, C, C#, D, D#, E, F, F#, G, G#, A, A#, B [7] & [8] Image features Feature Description Reference Mean Hue HSI and HSV scales are more close to how humans perceive color. Mean across all pixels in the image [1] Mean saturation Mean across all pixels in the image [1]
  • 5. Mean intensity Mean across all pixels in the image [1] Mean value of Red Mean across all pixels in the image [1] Mean value of Green Mean across all pixels in the image [1] Mean value of Blue Mean across all pixels in the image [1] Average RGB entropy Average entropy can be considered a proxy of how interesting the image is. Greater the entropy, more interesting the image. [11] Direction (gabor filter) The use of the Gabor filter makes it possible to see whether the image is marked by straight lines or transverse lines. [1] Dominant color in RGB The one color that is more prevelant in the image in the RGB space [1] Dominant color in HSV The one color that is more prevelant in the image in the HSV space [1] Building a music classifier This section describes the how a music classifier that classified music as either – happy, neutral or sad was built. Following diagram shows the approach that was followed, Figure 1 Overall method
  • 6. Data gathering Choice of music There is always a bias associated with the music that we hear. These biases may be due to the memories that the music evokes or an inherent perception of the artist, lyrics or the genre of the music. Also, popular music genres such as Hip hop, Jazz, Pop, Rock have a very complex musical structure and features derived from one such musical piece is not easily comparable to the other. Hence, I decided to use old classical piano music under the assumption that it would have less bias associated with it and features derived from the music pieces would be comparable to each other since only a single instrument, the piano, was used to produce it. The classical piano music was obtained from [2] in the 44.1KHz 128 kBit/s mp3 format. All the mp3s were trimmed to retain only the first 30 second similar to the approach taken in [1] which mentions that we form first impressions of an object in a short time, only a few seconds and also to keep the survey short to elicit more responses. There were a total 61 mp3s from 12 different artists such as Bach, Beethoven, Chopin etc. These music pieces were different parts of the artist symphony and have been translated to a piano format. More on this process has been described in [3]. Feature extraction All the features mentioned earlier in the Music features section were extracted from the music using the Python librosa library. The link to the Python code that was used is available in Appendix B – Python notebooks. Music selection for survey Ideally, each survey participant should have rated each of the 61 pieces. However, that would have taken more than half an hour to complete each survey. Since the survey was completely voluntary and no incentive was provided, it was shortened such that each survey only included 10 music files in a hope to have more responses. However, the Music pieces were presented in a random order in each survey to remove any relative biases between the music pieces. Additionally, each survey was designed to be representational of the complete music data set by identifying clusters of similar music. This was done by using unsupervised hierarchical clustering to create clusters of music files which were like each other w.r.t to the extracted features. A cluster count of 4 was chosen based on the following dendrogram and using a cutoff distance of 500000. Then, three files from cluster 1, three files from cluster 3 and four files from cluster 4 were randomly chosen for each run of the survey to get ten pieces of music such that they represented the complete dataset in terms of the features under consideration. (Cluster 2 contained only one file hence was skipped all together). Appendix A show the screenshots of the survey.
  • 7. Figure 2 Unsupervised hierarchical clustering The final list of mp3 were: beethoven_hammerklavier_3, islamei, waldstein_3, alb_se1, beethoven_les_adieux_1, brahms_opus1_2, mond_3, alb_esp1, bach_847, br_im6. Analyzing survey result The survey was run for a period of three weeks. 28 participants completed the survey. Having a small number of dataset (n=61) and large number of columns (p=52) resulted in the classical ‘small n, large p’ problem. Hence, although the survey asked to rate a piece of music on a bipolar scale with five choices, ‘very sad’, ‘sad, ‘neutral’, ‘happy’ and ‘very happy’, the results were compressed to a bipolar scale of only three choices: ‘sad’, ‘neutral’, ‘happy’ by changing the ‘very sad’ label to ‘sad’ and the ‘very happy’ label to ‘happy’. Choice of classifier The problem at hand could be categorized as a supervised clustering problem. The Random Forest Classifier [9] was chosen as the classifier for the following reasons, 1. Since the predictive power of each of the individual features were not known upfront, Random Forest would identify the feature importance. 2. It is an ensemble method hence is theoretically more accurate than using decision trees. 3. The relation between the features and the class label could not be assumed to be linear and hence logistic regression wouldn’t be suitable.
  • 8. Train-test split The results were split into a training set (80%) and a test set (20%). The training set was used to create the model. After the model was created it was used to predict the class labels for the test set. Additionally, it was also used the predict the class labels of the training set. Figure 3 Classification method [Source: 10]
  • 9. Results • The overall model accuracy as 31.2% • The model was a total failure in classifying happy and neutral pieces of music. • It was however very good at identifying sad pieces of music. • Recall for ‘happy’ was 0, for ‘neutral’ was 0.16 and for ‘sad’ it was 0.4. • Precision for ‘happy’ was 0, for ‘neutral’ was 0.25 and for ‘sad’ it was 0.8. • The following heat map of the Confusion Matrix summarizes this result, • Figure 4 Confusion Matrix • The model identified the following three as the most important features: 1. Average Spectral Centroid 2. MFCC 2 3. Beats per minute • The following graph summarizes the feature importance as identified by the classifier,
  • 10. Figure 5 Feature Importance • As an alternate approach, the MFCC coefficients were not considered as features but that reduces the accuracy of the classifier a lot. Future research Following are areas of future research that will be pursued to gain more insights on the correlation of music and images, 1. Creating an image mood classifier like the one created for music. 2. Create a classifier which considers features of both image and music. 3. Create a simple recommendation engine which uses these classifiers and additionally uses a form of collective intelligence by continuously recording responses.
  • 11. References 1. Sato, K. and Mitsukura, Y. (2013), Effects of Music on Image Impression and Relationship between Impression and Physical Properties. Electron. Comm. Jpn., 96: 53–61. doi:10.1002/ecj.11371 2. http://www.piano-midi.de/ 3. http://www.piano-midi.de/technic.htm 4. Tao Li, Mitsunori Ogihara, George Tzanetakis (eds.). Music data mining 5. Bojiong Ni, David Wugofski, Zhiming Sh (2016), Video game genre classification using video game music, Stanford University (http://cs229.stanford.edu/proj2016/report/NiShiWugofski_FinalReport.pdf) 6. Knees, Peter, Schedl, Markus (2016), Music Similarity and Retrieval - An Introduction to Audio- and Web-based Strategies. 7. http://musicinformationretrieval.com 8. Vivek Jayaram, Samarth Singal and Saroj Kandel (2015), Auto DJ mixing (https://github.com/vivjay30/AutoDJ) 9. http://scikit- learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html 10. https://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation- and-credibility 11. http://www.astro.cornell.edu/research/projects/compression/entropy.html
  • 12. Appendix A Survey screenshots Welcome screen Check audio screen
  • 13. Sample Question screen After submission screen
  • 14. Appendix B – Python notebooks https://github.com/vishalchangrani/thougtstream Feature extraction notebook: https://github.com/vishalchangrani/thougtstream/blob/master/musicfeatures.ipynb Unsupervised Hierarchical clustering notebook: https://github.com/vishalchangrani/thougtstream/blob/master/MusicClustering-hierarchical.ipynb Unsupervised K-means clustering notebook: https://github.com/vishalchangrani/thougtstream/blob/master/MusicClustering-kmeans.ipynb Survey result analysis notebook: https://github.com/vishalchangrani/thougtstream/blob/master/SurveyResultAnalysis.ipynb