BeepTunes Music Recommender System

Building a Music
Recommender
System from Scratch
on the Beep Tunes
Dataset
Niloufar Farajpour, Mohamadreza Kiani,
Mohamadreza Fereydooni, Tadeh Alexani
1
Rahnema College - Winter 2020

Frank Kane
As a data scientist,
question the results,
because, often there is
something you missed.
2

Approaches
● Manual Curation by Experts
● Editorial Tagging
● Audio Signals
● Recommender Systems
4

5
Recommender Systems
Content Based Collaborative Filtering
Memory BasedModel Based

6
CollaborativeFiltering(CF)
Model Based
Memory Based
Find similar users based on
cosine similarity or pearson
correlation and take
weighted avg. of ratings
Use machine learning to
ﬁnd user ratings of unrated
items. e.g. PCA, SVD, Neural
Nets, Matrix Factorization
Advantage
Easy creation and
explainability of results
Disadvantage
Performance Reduces when
data is sparse. So, non
scalable
Advantage
Dimensionality reduction
deals with missing/ sparse
data
Disadvantage
Inference is intractable
because of hidden/ latent
factors

7
User-based vs Item-based
Users who are similar to you also liked … Users who liked this item also liked …

8
EDA ● Given Dataset:
○ album_like.csv
○ album_track_purchase.csv
○ album_artist.info
○ download_album.csv
○ artist_like.csv
○ track_download.csv
○ track_like.csv
○ track_info.csv
○ track_tag.csv
○ track_artist.csv

9
EDA
Downloaded
Tracks
Purchased
Tracks
Likes
Tracks

10
EDA
USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE
8681411 553605400.0
2019-11-19
11:11:29
2019-10-23
00:10:13
0.0 1.0 0.0
46170847 550442677.0
2019-06-19
09:06:27
2018-05-22
09:05:20
0.0 1.0 0.0
469026688 486718552.0
2018-06-29
23:06:07
2017-07-23
16:07:31
0.0 1.0 0.0
6644142 6058526.0
2014-03-14
20:03:04
2013-03-21
16:03:15
0.0 0.5 0.0
511266472 509711880.0
2018-09-19
20:09:36
2018-02-20
12:02:32
0.0 0.5 0.0
ActionCollection

11
EDA
Impression =
(Like*4) + (Download*2) + (Purchase*1)
Based on the inverse of like, download, purchase frequency among all actions

12
EDA
Downloaded
Tracks
Purchased
Tracks
Likes
USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE
OCCURED_
AFTER
8681411 553605400.0
2019-11-19
11:11:29
2019-10-23
00:10:13
0.0 1.0 0.0
27 days
11:01:16
46170847 550442677.0
2019-06-19
09:06:27
2018-05-22
09:05:20
0.0 1.0 0.0
393 days
00:01:07
469026688 486718552.0
2018-06-29
23:06:07
2017-07-23
16:07:31
0.0 1.0 0.0
341 days
06:58:36
6644142 6058526.0
2014-03-14
20:03:04
2013-03-21
16:03:15
0.0 0.5 0.0
358 days
03:59:49
511266472 509711880.0
2018-09-19
20:09:36
2018-02-20
12:02:32
0.0 0.5 0.0
211 days
08:07:04

13
EDA
We considered two time parameters as well.
● Action-Publish:
○ If it is less than 10 days 1.5
○ If it is more than 10 days 1
● Action-Today:
1 + (Action Year - 2011) / 10 + Action Month * 0.025
Example: 2016-5 : 1 + (2016-2011)/10 + (5*0.025) 1.625

14
EDA
USER_ID TRACK_ID C_DATE LIKED DOWNLOAD PURCHASE Action-Today Action-Publish
11128390 6625384 2014-09-23
20:09:18
0 0.5 0 1.375 1
10481467 7014550 2014-08-01
12:08:21
0 0.5 0 1.35 1
38359414 2832153 2015-06-27
17:06:34
0 0 1 1.45 1
102312590 513326911 2018-05-30
19:05:49
0 0.5 0 1.75 1.5
12904873 2855686 2016-06-04
15:06:39
0 1 0 1.55 1

Evaluation: Outline of Online and Offline
Impression
15
Model
Recommendation List
1- Track A
2- Track B
3- Track C
Evaluate
1- Track A
2- Track B
1- Track A
2- Track B
Dataset for Recommendation
Dataset for
Recommendation
Model
1- Track A
2- Track B
3- Track C
Recommendation List
Compare
Correct Data
OnlineOffline

16
Evaluation
● Split to train/test data using date (e.g. a year)
○ From 70.000.000 action records from 2011 to 2020:
■ Train data -> 2019 -> ~ 10.000.000 actions
■ Test data -> 2020 -> ~ 1.000.000 actions

17
Evaluation
User_ID Track_ID Impresion Duration(Normal) Price(Normal)
46170847 6034074 2.7 1.665000 1.160
469026688 6036881 1.3 0.331667 1.245
Implicit
Explicit

18
Evaluation
● Evaluate the model by computing the
Mean Absolute Error (MAE) on the test data

19
Evaluation: RegressionEvaluator (pyspark.ml.evaluation)

20
Model: Model Based CF
● Compute a Correlation Score for every column pair in the matrix
● This gives us a Correlation Score between every pair of track
● Too long to compute
● Sparseness
● Scalability

21
Model: Memory Based CF
MLlib
● classiﬁcation: logistic regression, linear
SVM, naive bayes
● regression
● clustering: k-means
● collaborative ﬁltering: alternating least
squares (ALS)

23
Model: Matrix Factorization of User-Item Matrix
4.5 2.0
4.0 3.5
5.0 2.0
3.5 4.0 1.0
User
Item
1.2 0.8
1.4 0.9
1.5 1.0
1.2 0.8
1.5 1.2 1.0 0.8
1.7 0.6 1.1 0.4
= x
User Matrix
Item Matrix
W X Y Z
A
B
C
D
W X Y ZA
B
C
D

24
Model: Matrix Factorization of User-Item Matrix
● Latent factors are the features in the lower dimension
latent space projected from user-item interaction
matrix.
● Matrix factorization is one of very eﬀective dimension
reduction techniques in machine learning.

25
Model: ALS
Alternating Least Square (ALS) is also
a matrix factorization algorithm and it
runs itself in a parallel fashion.

26
Model: ALS
● Solve scalability and sparseness of the Ratings data
● It’s simple and scales well to very large datasets

27
Optimization: Grid Search to Find Model Best Parameters
Latent Factors Regularization Max Iterations MAE
50 0.1 10 0.7106074619
50 0.15 10 0.7042538296
50 0.2 10 0.7041747462
50 0.25 10 0.7087336317
100 0.1 10 0.7076845943
100 0.15 10 0.7036464454
100 0.2 10 0.7048896137
100 0.25 10 0.7097821267
150 0.1 10 0.7076660709
150 0.15 10 0.7035191506
150 0.2 10 0.704869857
150 0.25 10 0.7097177331
200 0.1 10 0.7077100207
→ Best
Parameters

28
Optimization: Evaluation
Content-Based Model Similar Tracks
Collaborative Filtering
User History
Expected
Tracks
User Recs
Compare

29
Optimization: EvaluationOptimization: Evaluation Result
● Using 2019 data as train → 10M
● Using 2020 data as test → 1M
● Running on the Old Result:
○ Total Users: 116526
○ Mean Score: 0.01%
○ Max Score: 40%
● Using all data → 70M
● Add date coeﬃcients
● Finding Best Parameters
● Running on the New Result:
○ Total Users: 577457
○ Mean Score: 1.1%
○ Max Score: 50%
x110 improvement based on new data

30
Recommender Systems
Content Based Collaborative Filtering
Memory BasedModel Based

31
EDA: Tracks Collection
TRACK_ID TIME_CREATED PRICE PUBLISH_DATE ALBUM_ID duration TYPE_KEY_CURATION TYPE_KEY_GENR
E
TAG_ID_37995085
1
TAG_ID_37995085
2
6034074 12/21/13 18:12 9990 10/23/13 16:10 6032170 232 0 1 0 1
6036881 12/22/13 15:12 1990 10/23/13 16:10 6012439 249 0 1 0 0
6037213 12/22/13 17:12 0 10/23/13 16:10 2828262 192 1 1 0 0
6049227 12/25/13 15:12 0 3/21/13 16:03 6048970 203 0 1 0 0
6059662 12/28/13 15:12 8990 1/1/12 1:01 6059612 549 0 1 1 0

32
Model: Content-Based Filtering
● Item proﬁle for each track we should construct a vector
based on it’s features like tags and artists it has
● User proﬁle for each user we need a vector that shows
his interests based on ratings or likes and downloads

33
Model: Content-Based Filtering/Item Proﬁle
0
1
1
0
0
Item vector
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5

34
Model: Content-Based Filtering/User profile
● User has rated items with profiles i1
, i2,
i3,
... ,
in
● One approach is weighted average of rated item
profiles

35
● Items are songs, only feature is “tag”
● Item profile: vector with 0 or 1 for each Actor
● Suppose user x has downloaded or liked 5 songs
● 2 songs featuring TAG A
● 3 songs featuring TAG B
● User profile = mean of item profiles
● Feature A’s weight = 2/5 = 0.4
● Feature B’s weight = 3/5 = 0.6

36
0.4
0.6
0
0
...
User vector
Tag A
Tag B
Tag C
Tag D
...

37
Model : Content-Based Filtering/Cosine similarity
● Estimate U(x,i) = cos(θ) = (x . i)/( |x| |i| )
θ

38
Pros
● No need for data on OTHER users
● Able to recommend to users with unique tastes
● Able to recommend new & unpopular items
○ No ﬁrst-rater problem
● Explanations for recommended items

39
Cons
● Finding the appropriate features is hard
● Overspecialization
○ Never recommends items outside user’s content
proﬁle

40
Evaluation: Sample Track
Homayoun Shajarian & Alireza Ghorbani
“Afsane Chashmhayat”
Genre: Persian Traditional Music

41
Evaluation: Similar Tracks Found by Content-Based Model
● Homayoun Shajarian - “Jana Be Negahi”
● Gholam-Hossein Banan - “Shart e Rah” Ballad
● Mohammad-Reza Shajarian - “Ah Baran”
● Alireza Eftekhari - “Asiri” Ballad
● Gholam-Hossein Banan - “Meykade Arezoo” Ballad
● Salar Aghili - “Daghe Jodaei”
● Homayoun Shajarian - “Be Tamashaye Negahat”
Genre:
Persian
Traditional
Music &
Ballad
+ Same
Artists in
Some Items

Spark Job
42
#!/bin/bash
HOME=/home/rc12g2
HADOOP_HOME=/user/rc12g2
source $HOME/.bashrc
echo "spark job -> started"
hadoop fs -rm -r -f $HADOOP_HOME/final-actions-v3
spark-submit $HOME/beeptunes_recsys/collaborative-filtering/final-action-aggregator.py > $HOME/log/spark.log
hadoop fs -getmerge $HADOOP_HOME/user_recs_v3 $HOME/result/user_recs_v3.csv
hadoop fs -rm -r -f $HADOOP_HOME/user_recs_v3
sed -i '1USER_ID,RECOMMENDATION_IDS' $HOME/result/user_recs_v3.csv
mongoimport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c user_rec --headerline --drop
$HOME/result/user_recs_v3.csv
echo "spark run -> complete"

Track Collection Update Job
43
#!/bin/bash
HOME=/home/rc12g2
echo "update_track_collection job -> started"
cd $HOME/beeptunes_recsys/jobs
mongoexport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c tracks --fieldFile
tracks_fields.txt --out ./export.csv > $HOME/log/update_track_collection.log
python update_trackCollection.py
mongo --port 28018 -u 4max -p HFmk87Q3DgfEKKgC g2recsys --eval 'db.tracks.drop()'
rm $HOME/beeptunes_recsys/jobs/export.csv
echo "update_track_collection job -> ended"

Crontab
44
rc12g2@BIHADOOP-Master1:~$ crontab -l
# m h dom mon dow command
0 2 1 * * /home/beeptunes_recsys/jobs/trackUpdate_job.sh
0 2 1 * * /home/beeptunes_recsys/jobs/spark_job.sh

Hybrid Model
46
/user/<user_id>/recommend
USER_ID Entered
Recommender
System
Trends**
Collection
Collaborative
Filtering
Content-Based
User Exists User Doesn’t Exists
<16* actions >=16* actions
*25% of users have less than 16 actions **Top 30 Tracks with Most Actions in the last 3 months

Hybrid Model
47
/track/<track_id>/similars
TRACK_ID
Entered
Recommender
System
Trends Collection
Content-Based
Track Exists Track Doesn’t Exists
Extract Similars

Niloufar
Farajpour
Mohamadreza
Fereydooni
Mohamadreza
Kiani
Tadeh
Alexani
Our Great Team!
50

BeepTunes Music Recommender System

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to BeepTunes Music Recommender System

Similar to BeepTunes Music Recommender System (20)

Recently uploaded

Recently uploaded (20)

BeepTunes Music Recommender System