4.6.16 AI&BigData Lab
Upcoming events: goo.gl/I2gJ4H
Это — рекомендательная система. Если взглянуть на нее со стороны, то она крепко застряла между Collaborative filtering и Content-based filtering. Используются рекомендательные системы уже давно, но рекомендации все еще не идеальны. Обычно проблемы — это выбор технологий или там фреймворка… А у нас — cold-start problem, semantic gap и др.!
AI&BigData Lab 2016. Игорь Костюк: Как приручить музыкальную рекомендательную систему
1. Igor Kostiuk | 2016
Tags: #music, #recommender_systems, #deep_learning, #neural_networks, #mel_spectrograms
How to train your music recommender system
3. Collaborative filtering
- cold start problem (requires a large amount of information on a
user in order to make accurate recommendations)
- will not recommend rare or new songs, games, etc. (popular
items will be much easier to recommend than unpopular items)
- bad scalability
+ content-agnostic
Example: Last.fm recommends music based on a comparison of
the listening habits of similar users.
5. Content-based filtering
- can only make recommendations that are similar to the original
seed
- semantic gap between audio or video, and the various aspects
of music / movie that affect user preferences (genre, mood)
- obvious recommendations ( Doom Doom 4 etc. )→
http://static.giantbomb.com/uploads/original/13/137381/2846580-doom.jpg
6. There is nothing more similar to the tea kettle than the other tea kettle
7. Approaches
1. Automatic generation of social tags
Social tags are user-generated keywords associated with song.
Predicting these social tags directly from MP3 files avoids the ''cold-
start problem''.
Using a set of one vs all classifiers for every tag, we can map audio
features onto social tags collected from the Web.
2. Music genre classification
Attempt to classify songs into a set of genre classes. Clustering – each
cluster represents a specific genre.
Setting label to each cluster by choosing the “majority vote” - which
genre was the most common in that cluster.
https://en.wikipedia.org/wiki/Mel-frequency_cepstrum
8. Deep Learning approach
Predicting listening preferences from audio signals by training a
regression model to predict the latent representations of songs
that were obtained from a collaborative filtering model.
Data
from a collaborative filtering model
Data
raw mp3
Latent factors vector extracting
matrix factorization
Mel-spectrograms extracting
Deep neural network
input output
prediction
10. Development stages
Data retrieval
The Echo Nest Taste Profile Subset
http://labrosa.ee.columbia.edu/millionsong/tasteprofile
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBSUJE12A6D4F8CF5 2
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBVFZR12A6D4F8AE3 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXALG12A8C13C108 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBXHDL12A81C204C0 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOBYHAJ12A6701BF1D 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SOCNMUH12A6D4F6E6D 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SODACBL12A8C13C273 1
b80344d063b5ccb3212f76538f3d9e43d87dca9e SODDNQT12A6D4F5F7E 5
Taste Profile subset is big. Some numbers:
1,019,318 unique users
384,546 unique MSD songs
48,373,586 user - song - play count triplets
11. Data retrieval
https://www.7digital.com/
We are able to attain 29 second audio clips for over 99% of the
dataset.
Original dataset has no raw audio, only precomputed, badly
documented features.
13. Weighted matrix factorization
n songs
m users ≈ *
musers
f
f
n songs
R P
Q
R – rating matrix m*n
P – user matrix m*f
Q – song matrix f*n
f – number of features
18. Mel-spectrograms
A mel-spectrograms is a kind of time-frequency representation.
It is obtained from an audio signal by computing the Fourier
transforms of short, overlapping windows.
Finally, the frequency axis is changed from a linear scale to a mel
scale.
https://en.wikipedia.org/wiki/Mel_scale
22. Mel-spectrograms
Used log-compressed mel-spectrograms with 128 components
and the window size and hop size 1024 and 512 audio frames
respectively.
https://github.com/librosa/librosa
http://librosa.github.io/librosa/generated/librosa.feature.melspectrogram.h
tml#librosa.feature.melspectrogram
24. Convolutional neural network
The deep neural network baseline architecture could be
consisted of two convolutional layers and two fully connected
layers.
http://benanne.github.io/2014/08/05/spotify-cnns.html
26. Convolutional neural network
The network can be trained on windows of 3 seconds sampled
randomly from the audio clips.
The last layer of the network is the output layer, which predicts
40 latent factors obtained from the collaborative filtering.