Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service

Building a model to quantify image and music correlation for an Artificial
Intelligence based recommendation service
BUS 697E – Directed Study, Fall 2017
Vishal Changrani, EvMBA 2018
Faculty advisor: Prof. Tom Smith
11/14/2017

Contents
Executive Summary...................................................................................................................................3
Introduction ..............................................................................................................................................3
Existing research .......................................................................................................................................3
Music features ......................................................................................................................................4
Image features......................................................................................................................................4
Building a music classifier .........................................................................................................................5
Data gathering ......................................................................................................................................6
Feature extraction.................................................................................................................................6
Music selection for survey ....................................................................................................................6
Analyzing survey result .........................................................................................................................7
Results...................................................................................................................................................9
Future research.......................................................................................................................................10
References ..............................................................................................................................................11
Appendix A Survey screenshots..............................................................................................................12
Welcome screen .................................................................................................................................12
Check audio screen .............................................................................................................................12
Sample Question screen .....................................................................................................................13
After submission screen......................................................................................................................13
Appendix B – Python notebooks.............................................................................................................14
Figure 1 Overall method ...............................................................................................................................5
Figure 2 Unsupervised hierarchical clustering..............................................................................................7
Figure 3 Classification method [Source: 10] .................................................................................................8
Figure 4 Confusion Matrix.............................................................................................................................9
Figure 5 Feature Importance ......................................................................................................................10

Executive Summary
➢ Existing research in the field of image and music correlation, music mood classification and
image impression helped identify a list of features that can be used to build a classifier.
➢ A classifier was built using these features to identify the mood of a given piece of music using
existing data and new data that was collected by running a survey.
➢ The classifier had an overall accuracy of 31% and a precision of 0.8 for music that was classified
as ‘sad’.
Introduction
As sentient beings our consciousness is supported by all our five senses working together creating a
holistic impression of our world such that the whole is greater than the sum of each. Hence it is not
surprising that when we see an image its emotional impact on our mind is exaggerated if combined with
music or the impact of the written word is heightened if it is overlaid on an image. If one can quantify
this seemingly subjective change in perception caused by mixing different mediums with reasonable
accuracy by a predictive model, then that model can be used in existing media and entertainment
related products and in advertisements. A recommendation service can also be built on top of this
model which can be monetized using different business models.
There has already been a lot of research in the field of perceptual psychology, advertising and
information technology to quantify this interaction between the visual and the auditory sensory
modalities. This report delineates some of this research. It also summarizes an attempt to create a
classifier for music which predicts the impression of the music on the listener and finally lists areas of
future research that may be pursued.
Existing research
Both image and music elicit an emotional response from us. These human emotions can be classified
using simple labels – sad, happy, angry, bright, dull etc. Reference [1] provides a great starting point on
how the interaction between music and images and their emotional response may be quantified using
some of the physical features of the medium. For images it uses features such as RGB values, HSI values
and traverse lines and for music it uses features such as volume, pitch and timbre. It demonstrates a
way in which by conducting simple experiments, a model can be built which predicts the effect of music
on the emotional impression of an image. It concludes that color information of the images they
considered were strongly correlated with adjectives expressing “potency and activity,” and the entropy
of saturation was correlated with words expressing spatial extent. Similarly, the physical properties
representing the power of the music were related to impression words expressing “potency and
activity”.
A presentation in which the audio and visual elements complement each other and enhance the overall
impact is said to have achieved ‘consonance’. For e.g. when a somber piece of music is played with a

somber type of image, the image appears to be even more dull and gloomy. Similarly, when a peppy or
happy piece of music is played with a happy image e.g. an image of a holiday spot, the image may
appear to be even more pleasing. Hence, research which identifies the mood of the music and research
which identifies the impression of an image can be used in tandem to identify music and images which
will produce consonance.
Using the existing research, a simple list of features for music and features for images was created.
Music features
Feature Description Reference
Average tempo as bpm
(beats per minute)
The frequency with which a human would tap their foot while
listening to the same piece of music.
[4]
Zero crossings Time-domain zero-crossings can be used to measure how
noisy is the signal and also somewhat correlates to high-
frequency content. The duration of all the song were same
hence an absolute count was used instead of a rate.
[4]
Spectral centroid A measure of “brightness” of a sound and relates to music
timbre
[5] & [6]
Average Bandwidth Indicator of the spectral range of the interesting parts in the
signal, i.e., the parts around the centroid. The average
bandwidth of a music piece may serve to describe its perceived
timbre
[5] & [6]
MFCC_x and
MFCC_SD_x
Mel Frequency Cepstral
Coefficients and it’s
standard deviation
MFCCs of a signal are a small set of features (usually about 10-
20) which concisely describe the overall shape of a spectral
envelope. It is a measure of the timbre of a piece of music. 12
Mfcc coefficients were derived.
[7]
Chroma_x and
Chroma_SD_x
Average CENS for each
of the 12 semitones and
it’s corresponding
standard deviation.
A chroma vector is a typically a 12-element feature vector
indicating how much energy of each pitch class, {C, C#, D, D#,
E, ..., B}, is present in the signal. It is used for identifying
similarity between two sounds. Chroma energy normalized
statistics (CENS) vector smooths Chroma over local deviations
in tempo, articulation, and musical ornaments such as trills
and arpeggiated chords. The CENS vector is the average of
CENS value for each pitch class: A, A#, B, C, C#, D, D#, E, F, F#,
G, G#, A, A#, B
[7] & [8]
Image features
Feature Description Reference
Mean Hue
HSI and HSV scales are more close to how humans perceive
color.
Mean across all pixels in the image
[1]
Mean saturation Mean across all pixels in the image [1]

Mean intensity Mean across all pixels in the image [1]
Mean value of Red Mean across all pixels in the image [1]
Mean value of Green Mean across all pixels in the image [1]
Mean value of Blue Mean across all pixels in the image [1]
Average RGB entropy
Average entropy can be considered a proxy of how interesting
the image is. Greater the entropy, more interesting the image.
[11]
Direction (gabor filter)
The use of the Gabor filter
makes it possible to see whether the image is marked by
straight lines or transverse lines.
[1]
Dominant color in RGB
The one color that is more prevelant in the image in the RGB
space
[1]
Dominant color in HSV
The one color that is more prevelant in the image in the HSV
space
[1]
Building a music classifier
This section describes the how a music classifier that classified music as either – happy, neutral or sad
was built. Following diagram shows the approach that was followed,
Figure 1 Overall method

Data gathering
Choice of music
There is always a bias associated with the music that we hear. These biases may be due to the memories
that the music evokes or an inherent perception of the artist, lyrics or the genre of the music. Also,
popular music genres such as Hip hop, Jazz, Pop, Rock have a very complex musical structure and
features derived from one such musical piece is not easily comparable to the other.
Hence, I decided to use old classical piano music under the assumption that it would have less bias
associated with it and features derived from the music pieces would be comparable to each other since
only a single instrument, the piano, was used to produce it.
The classical piano music was obtained from [2] in the 44.1KHz 128 kBit/s mp3 format. All the mp3s
were trimmed to retain only the first 30 second similar to the approach taken in [1] which mentions that
we form first impressions of an object in a short time, only a few seconds and also to keep the survey
short to elicit more responses.
There were a total 61 mp3s from 12 different artists such as Bach, Beethoven, Chopin etc. These music
pieces were different parts of the artist symphony and have been translated to a piano format. More on
this process has been described in [3].
Feature extraction
All the features mentioned earlier in the Music features section were extracted from the music using the
Python librosa library. The link to the Python code that was used is available in Appendix B – Python
notebooks.
Music selection for survey
Ideally, each survey participant should have rated each of the 61 pieces. However, that would have
taken more than half an hour to complete each survey. Since the survey was completely voluntary and
no incentive was provided, it was shortened such that each survey only included 10 music files in a hope
to have more responses. However, the Music pieces were presented in a random order in each survey to
remove any relative biases between the music pieces.
Additionally, each survey was designed to be representational of the complete music data set by
identifying clusters of similar music. This was done by using unsupervised hierarchical clustering to
create clusters of music files which were like each other w.r.t to the extracted features. A cluster count
of 4 was chosen based on the following dendrogram and using a cutoff distance of 500000. Then, three
files from cluster 1, three files from cluster 3 and four files from cluster 4 were randomly chosen for
each run of the survey to get ten pieces of music such that they represented the complete dataset in
terms of the features under consideration. (Cluster 2 contained only one file hence was skipped all
together). Appendix A show the screenshots of the survey.

Figure 2 Unsupervised hierarchical clustering
The final list of mp3 were: beethoven_hammerklavier_3, islamei, waldstein_3, alb_se1,
beethoven_les_adieux_1, brahms_opus1_2, mond_3, alb_esp1, bach_847, br_im6.
Analyzing survey result
The survey was run for a period of three weeks. 28 participants completed the survey. Having a small
number of dataset (n=61) and large number of columns (p=52) resulted in the classical ‘small n, large p’
problem. Hence, although the survey asked to rate a piece of music on a bipolar scale with five choices,
‘very sad’, ‘sad, ‘neutral’, ‘happy’ and ‘very happy’, the results were compressed to a bipolar scale of
only three choices: ‘sad’, ‘neutral’, ‘happy’ by changing the ‘very sad’ label to ‘sad’ and the ‘very happy’
label to ‘happy’.
Choice of classifier
The problem at hand could be categorized as a supervised clustering problem. The Random Forest
Classifier [9] was chosen as the classifier for the following reasons,
1. Since the predictive power of each of the individual features were not known upfront, Random
Forest would identify the feature importance.
2. It is an ensemble method hence is theoretically more accurate than using decision trees.
3. The relation between the features and the class label could not be assumed to be linear and
hence logistic regression wouldn’t be suitable.

Train-test split
The results were split into a training set (80%) and a test set (20%). The training set was used to create
the model. After the model was created it was used to predict the class labels for the test set.
Additionally, it was also used the predict the class labels of the training set.
Figure 3 Classification method [Source: 10]

Results
• The overall model accuracy as 31.2%
• The model was a total failure in classifying happy and neutral pieces of music.
• It was however very good at identifying sad pieces of music.
• Recall for ‘happy’ was 0, for ‘neutral’ was 0.16 and for ‘sad’ it was 0.4.
• Precision for ‘happy’ was 0, for ‘neutral’ was 0.25 and for ‘sad’ it was 0.8.
• The following heat map of the Confusion Matrix summarizes this result,
•
Figure 4 Confusion Matrix
• The model identified the following three as the most important features:
1. Average Spectral Centroid
2. MFCC 2
3. Beats per minute
• The following graph summarizes the feature importance as identified by the classifier,

Figure 5 Feature Importance
• As an alternate approach, the MFCC coefficients were not considered as features but that reduces
the accuracy of the classifier a lot.
Future research
Following are areas of future research that will be pursued to gain more insights on the correlation of
music and images,
1. Creating an image mood classifier like the one created for music.
2. Create a classifier which considers features of both image and music.
3. Create a simple recommendation engine which uses these classifiers and additionally uses a
form of collective intelligence by continuously recording responses.

References
1. Sato, K. and Mitsukura, Y. (2013), Effects of Music on Image Impression and Relationship
between Impression and Physical Properties. Electron. Comm. Jpn., 96: 53–61.
doi:10.1002/ecj.11371
2. http://www.piano-midi.de/
3. http://www.piano-midi.de/technic.htm
4. Tao Li, Mitsunori Ogihara, George Tzanetakis (eds.). Music data mining
5. Bojiong Ni, David Wugofski, Zhiming Sh (2016), Video game genre classification using video
game music, Stanford University
(http://cs229.stanford.edu/proj2016/report/NiShiWugofski_FinalReport.pdf)
6. Knees, Peter, Schedl, Markus (2016), Music Similarity and Retrieval - An Introduction to Audio-
and Web-based Strategies.
7. http://musicinformationretrieval.com
8. Vivek Jayaram, Samarth Singal and Saroj Kandel (2015), Auto DJ mixing
(https://github.com/vivjay30/AutoDJ)
9. http://scikit-
learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html
10. https://www.slideshare.net/pierluca.lanzi/machine-learning-and-data-mining-14-evaluation-
and-credibility
11. http://www.astro.cornell.edu/research/projects/compression/entropy.html

Appendix A Survey screenshots
Welcome screen
Check audio screen

Sample Question screen
After submission screen

Appendix B – Python notebooks
https://github.com/vishalchangrani/thougtstream
Feature extraction notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/musicfeatures.ipynb
Unsupervised Hierarchical clustering notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/MusicClustering-hierarchical.ipynb
Unsupervised K-means clustering notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/MusicClustering-kmeans.ipynb
Survey result analysis notebook:
https://github.com/vishalchangrani/thougtstream/blob/master/SurveyResultAnalysis.ipynb

Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service

Recommended

Recommended

More Related Content

Similar to Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service

Similar to Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service (20)

Recently uploaded

Recently uploaded (20)

Building a model to quantify image and music correlation for an Artificial Intelligence based recommendation service