Building a Music
Recommender
System from Scratch
on the Beep Tunes
Dataset
Niloufar Farajpour, Mohamadreza Kiani,
Mohamadreza Fereydooni, Tadeh Alexani
1
Rahnema College - Winter 2020
Frank Kane
As a data scientist,
question the results,
because, often there is
something you missed.
2
3
Dataset:
BEEPTUNES.COM
Approaches
ā— Manual Curation by Experts
ā— Editorial Tagging
ā— Audio Signals
ā— Recommender Systems
4
5
Recommender Systems
Content Based Collaborative Filtering
Memory BasedModel Based
6
CollaborativeFiltering(CF)
Model Based
Memory Based
Find similar users based on
cosine similarity or pearson
correlation and take
weighted avg. of ratings
Use machine learning to
find user ratings of unrated
items. e.g. PCA, SVD, Neural
Nets, Matrix Factorization
Advantage
Easy creation and
explainability of results
Disadvantage
Performance Reduces when
data is sparse. So, non
scalable
Advantage
Dimensionality reduction
deals with missing/ sparse
data
Disadvantage
Inference is intractable
because of hidden/ latent
factors
7
User-based vs Item-based
Users who are similar to you also liked … Users who liked this item also liked …
8
EDA ā— Given Dataset:
ā—‹ album_like.csv
ā—‹ album_track_purchase.csv
ā—‹ album_artist.info
ā—‹ download_album.csv
ā—‹ artist_like.csv
ā—‹ track_download.csv
ā—‹ track_like.csv
ā—‹ track_info.csv
ā—‹ track_tag.csv
ā—‹ track_artist.csv
9
EDA
Downloaded
Tracks
Purchased
Tracks
Likes
Tracks
10
EDA
USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE
8681411 553605400.0
2019-11-19
11:11:29
2019-10-23
00:10:13
0.0 1.0 0.0
46170847 550442677.0
2019-06-19
09:06:27
2018-05-22
09:05:20
0.0 1.0 0.0
469026688 486718552.0
2018-06-29
23:06:07
2017-07-23
16:07:31
0.0 1.0 0.0
6644142 6058526.0
2014-03-14
20:03:04
2013-03-21
16:03:15
0.0 0.5 0.0
511266472 509711880.0
2018-09-19
20:09:36
2018-02-20
12:02:32
0.0 0.5 0.0
ActionCollection
11
EDA
Impression =
(Like*4) + (Download*2) + (Purchase*1)
Based on the inverse of like, download, purchase frequency among all actions
12
EDA
Downloaded
Tracks
Purchased
Tracks
Likes
USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE
OCCURED_
AFTER
8681411 553605400.0
2019-11-19
11:11:29
2019-10-23
00:10:13
0.0 1.0 0.0
27 days
11:01:16
46170847 550442677.0
2019-06-19
09:06:27
2018-05-22
09:05:20
0.0 1.0 0.0
393 days
00:01:07
469026688 486718552.0
2018-06-29
23:06:07
2017-07-23
16:07:31
0.0 1.0 0.0
341 days
06:58:36
6644142 6058526.0
2014-03-14
20:03:04
2013-03-21
16:03:15
0.0 0.5 0.0
358 days
03:59:49
511266472 509711880.0
2018-09-19
20:09:36
2018-02-20
12:02:32
0.0 0.5 0.0
211 days
08:07:04
13
EDA
We considered two time parameters as well.
ā— Action-Publish:
ā—‹ If it is less than 10 days 1.5
ā—‹ If it is more than 10 days 1
ā— Action-Today:
1 + (Action Year - 2011) / 10 + Action Month * 0.025
Example: 2016-5 : 1 + (2016-2011)/10 + (5*0.025) 1.625
14
EDA
USER_ID TRACK_ID C_DATE LIKED DOWNLOAD PURCHASE Action-Today Action-Publish
11128390 6625384 2014-09-23
20:09:18
0 0.5 0 1.375 1
10481467 7014550 2014-08-01
12:08:21
0 0.5 0 1.35 1
38359414 2832153 2015-06-27
17:06:34
0 0 1 1.45 1
102312590 513326911 2018-05-30
19:05:49
0 0.5 0 1.75 1.5
12904873 2855686 2016-06-04
15:06:39
0 1 0 1.55 1
Evaluation: Outline of Online and Offline
Impression
15
Model
Recommendation List
1- Track A
2- Track B
3- Track C
Evaluate
1- Track A
2- Track B
1- Track A
2- Track B
Dataset for Recommendation
Dataset for
Recommendation
Model
1- Track A
2- Track B
3- Track C
Recommendation List
Compare
Correct Data
OnlineOffline
16
Evaluation
ā— Split to train/test data using date (e.g. a year)
ā—‹ From 70.000.000 action records from 2011 to 2020:
ā–  Train data -> 2019 -> ~ 10.000.000 actions
ā–  Test data -> 2020 -> ~ 1.000.000 actions
17
Evaluation
User_ID Track_ID Impresion Duration(Normal) Price(Normal)
46170847 6034074 2.7 1.665000 1.160
469026688 6036881 1.3 0.331667 1.245
Implicit
Explicit
18
Evaluation
ā— Evaluate the model by computing the
Mean Absolute Error (MAE) on the test data
19
Evaluation: RegressionEvaluator (pyspark.ml.evaluation)
20
Model: Model Based CF
ā— Compute a Correlation Score for every column pair in the matrix
ā— This gives us a Correlation Score between every pair of track
ā— Too long to compute
ā— Sparseness
ā— Scalability
21
Model: Memory Based CF
MLlib
ā— classification: logistic regression, linear
SVM, naive bayes
ā— regression
ā— clustering: k-means
ā— collaborative filtering: alternating least
squares (ALS)
22
23
Model: Matrix Factorization of User-Item Matrix
4.5 2.0
4.0 3.5
5.0 2.0
3.5 4.0 1.0
User
Item
1.2 0.8
1.4 0.9
1.5 1.0
1.2 0.8
1.5 1.2 1.0 0.8
1.7 0.6 1.1 0.4
= x
User Matrix
Item Matrix
W X Y Z
A
B
C
D
W X Y ZA
B
C
D
24
Model: Matrix Factorization of User-Item Matrix
ā— Latent factors are the features in the lower dimension
latent space projected from user-item interaction
matrix.
ā— Matrix factorization is one of very effective dimension
reduction techniques in machine learning.
25
Model: ALS
Alternating Least Square (ALS) is also
a matrix factorization algorithm and it
runs itself in a parallel fashion.
26
Model: ALS
ā— Solve scalability and sparseness of the Ratings data
ā— It’s simple and scales well to very large datasets
27
Optimization: Grid Search to Find Model Best Parameters
Latent Factors Regularization Max Iterations MAE
50 0.1 10 0.7106074619
50 0.15 10 0.7042538296
50 0.2 10 0.7041747462
50 0.25 10 0.7087336317
100 0.1 10 0.7076845943
100 0.15 10 0.7036464454
100 0.2 10 0.7048896137
100 0.25 10 0.7097821267
150 0.1 10 0.7076660709
150 0.15 10 0.7035191506
150 0.2 10 0.704869857
150 0.25 10 0.7097177331
200 0.1 10 0.7077100207
→ Best
Parameters
28
Optimization: Evaluation
Content-Based Model Similar Tracks
Collaborative Filtering
User History
Expected
Tracks
User Recs
Compare
29
Optimization: EvaluationOptimization: Evaluation Result
ā— Using 2019 data as train → 10M
ā— Using 2020 data as test → 1M
ā— Running on the Old Result:
ā—‹ Total Users: 116526
ā—‹ Mean Score: 0.01%
ā—‹ Max Score: 40%
ā— Using all data → 70M
ā— Add date coefficients
ā— Finding Best Parameters
ā— Running on the New Result:
ā—‹ Total Users: 577457
ā—‹ Mean Score: 1.1%
ā—‹ Max Score: 50%
x110 improvement based on new data
30
Recommender Systems
Content Based Collaborative Filtering
Memory BasedModel Based
31
EDA: Tracks Collection
TRACK_ID TIME_CREATED PRICE PUBLISH_DATE ALBUM_ID duration TYPE_KEY_CURATION TYPE_KEY_GENR
E
TAG_ID_37995085
1
TAG_ID_37995085
2
6034074 12/21/13 18:12 9990 10/23/13 16:10 6032170 232 0 1 0 1
6036881 12/22/13 15:12 1990 10/23/13 16:10 6012439 249 0 1 0 0
6037213 12/22/13 17:12 0 10/23/13 16:10 2828262 192 1 1 0 0
6049227 12/25/13 15:12 0 3/21/13 16:03 6048970 203 0 1 0 0
6059662 12/28/13 15:12 8990 1/1/12 1:01 6059612 549 0 1 1 0
32
Model: Content-Based Filtering
ā— Item profile for each track we should construct a vector
based on it’s features like tags and artists it has
ā— User profile for each user we need a vector that shows
his interests based on ratings or likes and downloads
33
Model: Content-Based Filtering/Item Profile
0
1
1
0
0
Item vector
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
34
Model: Content-Based Filtering/User profile
ā— User has rated items with profiles i1
, i2,
i3,
... ,
in
ā— One approach is weighted average of rated item
profiles
35
Model: Content-Based Filtering/User profile
ā— Items are songs, only feature is ā€œtagā€
ā— Item profile: vector with 0 or 1 for each Actor
ā— Suppose user x has downloaded or liked 5 songs
ā— 2 songs featuring TAG A
ā— 3 songs featuring TAG B
ā— User profile = mean of item profiles
ā— Feature A’s weight = 2/5 = 0.4
ā— Feature B’s weight = 3/5 = 0.6
36
Model: Content-Based Filtering/User profile
0.4
0.6
0
0
...
User vector
Tag A
Tag B
Tag C
Tag D
...
37
Model : Content-Based Filtering/Cosine similarity
ā— Estimate U(x,i) = cos(Īø) = (x . i)/( |x| |i| )
Īø
38
Pros
ā— No need for data on OTHER users
ā— Able to recommend to users with unique tastes
ā— Able to recommend new & unpopular items
ā—‹ No first-rater problem
ā— Explanations for recommended items
39
Cons
ā— Finding the appropriate features is hard
ā— Overspecialization
ā—‹ Never recommends items outside user’s content
profile
40
Evaluation: Sample Track
Homayoun Shajarian & Alireza Ghorbani
ā€œAfsane Chashmhayatā€
Genre: Persian Traditional Music
41
Evaluation: Similar Tracks Found by Content-Based Model
ā— Homayoun Shajarian - ā€œJana Be Negahiā€
ā— Gholam-Hossein Banan - ā€œShart e Rahā€ Ballad
ā— Mohammad-Reza Shajarian - ā€œAh Baranā€
ā— Alireza Eftekhari - ā€œAsiriā€ Ballad
ā— Gholam-Hossein Banan - ā€œMeykade Arezooā€ Ballad
ā— Salar Aghili - ā€œDaghe Jodaeiā€
ā— Homayoun Shajarian - ā€œBe Tamashaye Negahatā€
Genre:
Persian
Traditional
Music &
Ballad
+ Same
Artists in
Some Items
Spark Job
42
#!/bin/bash
HOME=/home/rc12g2
HADOOP_HOME=/user/rc12g2
source $HOME/.bashrc
echo "spark job -> started"
hadoop fs -rm -r -f $HADOOP_HOME/final-actions-v3
spark-submit $HOME/beeptunes_recsys/collaborative-filtering/final-action-aggregator.py > $HOME/log/spark.log
hadoop fs -getmerge $HADOOP_HOME/user_recs_v3 $HOME/result/user_recs_v3.csv
hadoop fs -rm -r -f $HADOOP_HOME/user_recs_v3
sed -i '1USER_ID,RECOMMENDATION_IDS' $HOME/result/user_recs_v3.csv
mongoimport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c user_rec --headerline --drop
$HOME/result/user_recs_v3.csv
echo "spark run -> complete"
Track Collection Update Job
43
#!/bin/bash
HOME=/home/rc12g2
echo "update_track_collection job -> started"
cd $HOME/beeptunes_recsys/jobs
mongoexport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c tracks --fieldFile
tracks_fields.txt --out ./export.csv > $HOME/log/update_track_collection.log
python update_trackCollection.py
mongo --port 28018 -u 4max -p HFmk87Q3DgfEKKgC g2recsys --eval 'db.tracks.drop()'
rm $HOME/beeptunes_recsys/jobs/export.csv
echo "update_track_collection job -> ended"
Crontab
44
rc12g2@BIHADOOP-Master1:~$ crontab -l
# m h dom mon dow command
0 2 1 * * /home/beeptunes_recsys/jobs/trackUpdate_job.sh
0 2 1 * * /home/beeptunes_recsys/jobs/spark_job.sh
Cold Start Problem?
45
Hybrid Model
46
/user/<user_id>/recommend
USER_ID Entered
Recommender
System
Trends**
Collection
Collaborative
Filtering
Content-Based
User Exists User Doesn’t Exists
<16* actions >=16* actions
*25% of users have less than 16 actions **Top 30 Tracks with Most Actions in the last 3 months
Hybrid Model
47
/track/<track_id>/similars
TRACK_ID
Entered
Recommender
System
Trends Collection
Content-Based
Track Exists Track Doesn’t Exists
Extract Similars
48
49
Niloufar
Farajpour
Mohamadreza
Fereydooni
Mohamadreza
Kiani
Tadeh
Alexani
Our Great Team!
50
Thank You!
51

BeepTunes Music Recommender System

  • 1.
    Building a Music Recommender Systemfrom Scratch on the Beep Tunes Dataset Niloufar Farajpour, Mohamadreza Kiani, Mohamadreza Fereydooni, Tadeh Alexani 1 Rahnema College - Winter 2020
  • 2.
    Frank Kane As adata scientist, question the results, because, often there is something you missed. 2
  • 3.
  • 4.
    Approaches ā— Manual Curationby Experts ā— Editorial Tagging ā— Audio Signals ā— Recommender Systems 4
  • 5.
    5 Recommender Systems Content BasedCollaborative Filtering Memory BasedModel Based
  • 6.
    6 CollaborativeFiltering(CF) Model Based Memory Based Findsimilar users based on cosine similarity or pearson correlation and take weighted avg. of ratings Use machine learning to find user ratings of unrated items. e.g. PCA, SVD, Neural Nets, Matrix Factorization Advantage Easy creation and explainability of results Disadvantage Performance Reduces when data is sparse. So, non scalable Advantage Dimensionality reduction deals with missing/ sparse data Disadvantage Inference is intractable because of hidden/ latent factors
  • 7.
    7 User-based vs Item-based Userswho are similar to you also liked … Users who liked this item also liked …
  • 8.
    8 EDA ā— GivenDataset: ā—‹ album_like.csv ā—‹ album_track_purchase.csv ā—‹ album_artist.info ā—‹ download_album.csv ā—‹ artist_like.csv ā—‹ track_download.csv ā—‹ track_like.csv ā—‹ track_info.csv ā—‹ track_tag.csv ā—‹ track_artist.csv
  • 9.
  • 10.
    10 EDA USER_ID TRACK_ID C_DATEPUBLISH_DATE LIKED DOWNLOAD PURCHASE 8681411 553605400.0 2019-11-19 11:11:29 2019-10-23 00:10:13 0.0 1.0 0.0 46170847 550442677.0 2019-06-19 09:06:27 2018-05-22 09:05:20 0.0 1.0 0.0 469026688 486718552.0 2018-06-29 23:06:07 2017-07-23 16:07:31 0.0 1.0 0.0 6644142 6058526.0 2014-03-14 20:03:04 2013-03-21 16:03:15 0.0 0.5 0.0 511266472 509711880.0 2018-09-19 20:09:36 2018-02-20 12:02:32 0.0 0.5 0.0 ActionCollection
  • 11.
    11 EDA Impression = (Like*4) +(Download*2) + (Purchase*1) Based on the inverse of like, download, purchase frequency among all actions
  • 12.
    12 EDA Downloaded Tracks Purchased Tracks Likes USER_ID TRACK_ID C_DATEPUBLISH_DATE LIKED DOWNLOAD PURCHASE OCCURED_ AFTER 8681411 553605400.0 2019-11-19 11:11:29 2019-10-23 00:10:13 0.0 1.0 0.0 27 days 11:01:16 46170847 550442677.0 2019-06-19 09:06:27 2018-05-22 09:05:20 0.0 1.0 0.0 393 days 00:01:07 469026688 486718552.0 2018-06-29 23:06:07 2017-07-23 16:07:31 0.0 1.0 0.0 341 days 06:58:36 6644142 6058526.0 2014-03-14 20:03:04 2013-03-21 16:03:15 0.0 0.5 0.0 358 days 03:59:49 511266472 509711880.0 2018-09-19 20:09:36 2018-02-20 12:02:32 0.0 0.5 0.0 211 days 08:07:04
  • 13.
    13 EDA We considered twotime parameters as well. ā— Action-Publish: ā—‹ If it is less than 10 days 1.5 ā—‹ If it is more than 10 days 1 ā— Action-Today: 1 + (Action Year - 2011) / 10 + Action Month * 0.025 Example: 2016-5 : 1 + (2016-2011)/10 + (5*0.025) 1.625
  • 14.
    14 EDA USER_ID TRACK_ID C_DATELIKED DOWNLOAD PURCHASE Action-Today Action-Publish 11128390 6625384 2014-09-23 20:09:18 0 0.5 0 1.375 1 10481467 7014550 2014-08-01 12:08:21 0 0.5 0 1.35 1 38359414 2832153 2015-06-27 17:06:34 0 0 1 1.45 1 102312590 513326911 2018-05-30 19:05:49 0 0.5 0 1.75 1.5 12904873 2855686 2016-06-04 15:06:39 0 1 0 1.55 1
  • 15.
    Evaluation: Outline ofOnline and Offline Impression 15 Model Recommendation List 1- Track A 2- Track B 3- Track C Evaluate 1- Track A 2- Track B 1- Track A 2- Track B Dataset for Recommendation Dataset for Recommendation Model 1- Track A 2- Track B 3- Track C Recommendation List Compare Correct Data OnlineOffline
  • 16.
    16 Evaluation ā— Split totrain/test data using date (e.g. a year) ā—‹ From 70.000.000 action records from 2011 to 2020: ā–  Train data -> 2019 -> ~ 10.000.000 actions ā–  Test data -> 2020 -> ~ 1.000.000 actions
  • 17.
    17 Evaluation User_ID Track_ID ImpresionDuration(Normal) Price(Normal) 46170847 6034074 2.7 1.665000 1.160 469026688 6036881 1.3 0.331667 1.245 Implicit Explicit
  • 18.
    18 Evaluation ā— Evaluate themodel by computing the Mean Absolute Error (MAE) on the test data
  • 19.
  • 20.
    20 Model: Model BasedCF ā— Compute a Correlation Score for every column pair in the matrix ā— This gives us a Correlation Score between every pair of track ā— Too long to compute ā— Sparseness ā— Scalability
  • 21.
    21 Model: Memory BasedCF MLlib ā— classification: logistic regression, linear SVM, naive bayes ā— regression ā— clustering: k-means ā— collaborative filtering: alternating least squares (ALS)
  • 22.
  • 23.
    23 Model: Matrix Factorizationof User-Item Matrix 4.5 2.0 4.0 3.5 5.0 2.0 3.5 4.0 1.0 User Item 1.2 0.8 1.4 0.9 1.5 1.0 1.2 0.8 1.5 1.2 1.0 0.8 1.7 0.6 1.1 0.4 = x User Matrix Item Matrix W X Y Z A B C D W X Y ZA B C D
  • 24.
    24 Model: Matrix Factorizationof User-Item Matrix ā— Latent factors are the features in the lower dimension latent space projected from user-item interaction matrix. ā— Matrix factorization is one of very effective dimension reduction techniques in machine learning.
  • 25.
    25 Model: ALS Alternating LeastSquare (ALS) is also a matrix factorization algorithm and it runs itself in a parallel fashion.
  • 26.
    26 Model: ALS ā— Solvescalability and sparseness of the Ratings data ā— It’s simple and scales well to very large datasets
  • 27.
    27 Optimization: Grid Searchto Find Model Best Parameters Latent Factors Regularization Max Iterations MAE 50 0.1 10 0.7106074619 50 0.15 10 0.7042538296 50 0.2 10 0.7041747462 50 0.25 10 0.7087336317 100 0.1 10 0.7076845943 100 0.15 10 0.7036464454 100 0.2 10 0.7048896137 100 0.25 10 0.7097821267 150 0.1 10 0.7076660709 150 0.15 10 0.7035191506 150 0.2 10 0.704869857 150 0.25 10 0.7097177331 200 0.1 10 0.7077100207 → Best Parameters
  • 28.
    28 Optimization: Evaluation Content-Based ModelSimilar Tracks Collaborative Filtering User History Expected Tracks User Recs Compare
  • 29.
    29 Optimization: EvaluationOptimization: EvaluationResult ā— Using 2019 data as train → 10M ā— Using 2020 data as test → 1M ā— Running on the Old Result: ā—‹ Total Users: 116526 ā—‹ Mean Score: 0.01% ā—‹ Max Score: 40% ā— Using all data → 70M ā— Add date coefficients ā— Finding Best Parameters ā— Running on the New Result: ā—‹ Total Users: 577457 ā—‹ Mean Score: 1.1% ā—‹ Max Score: 50% x110 improvement based on new data
  • 30.
    30 Recommender Systems Content BasedCollaborative Filtering Memory BasedModel Based
  • 31.
    31 EDA: Tracks Collection TRACK_IDTIME_CREATED PRICE PUBLISH_DATE ALBUM_ID duration TYPE_KEY_CURATION TYPE_KEY_GENR E TAG_ID_37995085 1 TAG_ID_37995085 2 6034074 12/21/13 18:12 9990 10/23/13 16:10 6032170 232 0 1 0 1 6036881 12/22/13 15:12 1990 10/23/13 16:10 6012439 249 0 1 0 0 6037213 12/22/13 17:12 0 10/23/13 16:10 2828262 192 1 1 0 0 6049227 12/25/13 15:12 0 3/21/13 16:03 6048970 203 0 1 0 0 6059662 12/28/13 15:12 8990 1/1/12 1:01 6059612 549 0 1 1 0
  • 32.
    32 Model: Content-Based Filtering ā—Item profile for each track we should construct a vector based on it’s features like tags and artists it has ā— User profile for each user we need a vector that shows his interests based on ratings or likes and downloads
  • 33.
    33 Model: Content-Based Filtering/ItemProfile 0 1 1 0 0 Item vector Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
  • 34.
    34 Model: Content-Based Filtering/Userprofile ā— User has rated items with profiles i1 , i2, i3, ... , in ā— One approach is weighted average of rated item profiles
  • 35.
    35 Model: Content-Based Filtering/Userprofile ā— Items are songs, only feature is ā€œtagā€ ā— Item profile: vector with 0 or 1 for each Actor ā— Suppose user x has downloaded or liked 5 songs ā— 2 songs featuring TAG A ā— 3 songs featuring TAG B ā— User profile = mean of item profiles ā— Feature A’s weight = 2/5 = 0.4 ā— Feature B’s weight = 3/5 = 0.6
  • 36.
    36 Model: Content-Based Filtering/Userprofile 0.4 0.6 0 0 ... User vector Tag A Tag B Tag C Tag D ...
  • 37.
    37 Model : Content-BasedFiltering/Cosine similarity ā— Estimate U(x,i) = cos(Īø) = (x . i)/( |x| |i| ) Īø
  • 38.
    38 Pros ā— No needfor data on OTHER users ā— Able to recommend to users with unique tastes ā— Able to recommend new & unpopular items ā—‹ No first-rater problem ā— Explanations for recommended items
  • 39.
    39 Cons ā— Finding theappropriate features is hard ā— Overspecialization ā—‹ Never recommends items outside user’s content profile
  • 40.
    40 Evaluation: Sample Track HomayounShajarian & Alireza Ghorbani ā€œAfsane Chashmhayatā€ Genre: Persian Traditional Music
  • 41.
    41 Evaluation: Similar TracksFound by Content-Based Model ā— Homayoun Shajarian - ā€œJana Be Negahiā€ ā— Gholam-Hossein Banan - ā€œShart e Rahā€ Ballad ā— Mohammad-Reza Shajarian - ā€œAh Baranā€ ā— Alireza Eftekhari - ā€œAsiriā€ Ballad ā— Gholam-Hossein Banan - ā€œMeykade Arezooā€ Ballad ā— Salar Aghili - ā€œDaghe Jodaeiā€ ā— Homayoun Shajarian - ā€œBe Tamashaye Negahatā€ Genre: Persian Traditional Music & Ballad + Same Artists in Some Items
  • 42.
    Spark Job 42 #!/bin/bash HOME=/home/rc12g2 HADOOP_HOME=/user/rc12g2 source $HOME/.bashrc echo"spark job -> started" hadoop fs -rm -r -f $HADOOP_HOME/final-actions-v3 spark-submit $HOME/beeptunes_recsys/collaborative-filtering/final-action-aggregator.py > $HOME/log/spark.log hadoop fs -getmerge $HADOOP_HOME/user_recs_v3 $HOME/result/user_recs_v3.csv hadoop fs -rm -r -f $HADOOP_HOME/user_recs_v3 sed -i '1USER_ID,RECOMMENDATION_IDS' $HOME/result/user_recs_v3.csv mongoimport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c user_rec --headerline --drop $HOME/result/user_recs_v3.csv echo "spark run -> complete"
  • 43.
    Track Collection UpdateJob 43 #!/bin/bash HOME=/home/rc12g2 echo "update_track_collection job -> started" cd $HOME/beeptunes_recsys/jobs mongoexport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c tracks --fieldFile tracks_fields.txt --out ./export.csv > $HOME/log/update_track_collection.log python update_trackCollection.py mongo --port 28018 -u 4max -p HFmk87Q3DgfEKKgC g2recsys --eval 'db.tracks.drop()' rm $HOME/beeptunes_recsys/jobs/export.csv echo "update_track_collection job -> ended"
  • 44.
    Crontab 44 rc12g2@BIHADOOP-Master1:~$ crontab -l #m h dom mon dow command 0 2 1 * * /home/beeptunes_recsys/jobs/trackUpdate_job.sh 0 2 1 * * /home/beeptunes_recsys/jobs/spark_job.sh
  • 45.
  • 46.
    Hybrid Model 46 /user/<user_id>/recommend USER_ID Entered Recommender System Trends** Collection Collaborative Filtering Content-Based UserExists User Doesn’t Exists <16* actions >=16* actions *25% of users have less than 16 actions **Top 30 Tracks with Most Actions in the last 3 months
  • 47.
  • 48.
  • 49.
  • 50.
  • 51.