SlideShare a Scribd company logo
Building a Music
Recommender
System from Scratch
on the Beep Tunes
Dataset
Niloufar Farajpour, Mohamadreza Kiani,
Mohamadreza Fereydooni, Tadeh Alexani
1
Rahnema College - Winter 2020
Frank Kane
As a data scientist,
question the results,
because, often there is
something you missed.
2
3
Dataset:
BEEPTUNES.COM
Approaches
● Manual Curation by Experts
● Editorial Tagging
● Audio Signals
● Recommender Systems
4
5
Recommender Systems
Content Based Collaborative Filtering
Memory BasedModel Based
6
CollaborativeFiltering(CF)
Model Based
Memory Based
Find similar users based on
cosine similarity or pearson
correlation and take
weighted avg. of ratings
Use machine learning to
find user ratings of unrated
items. e.g. PCA, SVD, Neural
Nets, Matrix Factorization
Advantage
Easy creation and
explainability of results
Disadvantage
Performance Reduces when
data is sparse. So, non
scalable
Advantage
Dimensionality reduction
deals with missing/ sparse
data
Disadvantage
Inference is intractable
because of hidden/ latent
factors
7
User-based vs Item-based
Users who are similar to you also liked … Users who liked this item also liked …
8
EDA ● Given Dataset:
○ album_like.csv
○ album_track_purchase.csv
○ album_artist.info
○ download_album.csv
○ artist_like.csv
○ track_download.csv
○ track_like.csv
○ track_info.csv
○ track_tag.csv
○ track_artist.csv
9
EDA
Downloaded
Tracks
Purchased
Tracks
Likes
Tracks
10
EDA
USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE
8681411 553605400.0
2019-11-19
11:11:29
2019-10-23
00:10:13
0.0 1.0 0.0
46170847 550442677.0
2019-06-19
09:06:27
2018-05-22
09:05:20
0.0 1.0 0.0
469026688 486718552.0
2018-06-29
23:06:07
2017-07-23
16:07:31
0.0 1.0 0.0
6644142 6058526.0
2014-03-14
20:03:04
2013-03-21
16:03:15
0.0 0.5 0.0
511266472 509711880.0
2018-09-19
20:09:36
2018-02-20
12:02:32
0.0 0.5 0.0
ActionCollection
11
EDA
Impression =
(Like*4) + (Download*2) + (Purchase*1)
Based on the inverse of like, download, purchase frequency among all actions
12
EDA
Downloaded
Tracks
Purchased
Tracks
Likes
USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE
OCCURED_
AFTER
8681411 553605400.0
2019-11-19
11:11:29
2019-10-23
00:10:13
0.0 1.0 0.0
27 days
11:01:16
46170847 550442677.0
2019-06-19
09:06:27
2018-05-22
09:05:20
0.0 1.0 0.0
393 days
00:01:07
469026688 486718552.0
2018-06-29
23:06:07
2017-07-23
16:07:31
0.0 1.0 0.0
341 days
06:58:36
6644142 6058526.0
2014-03-14
20:03:04
2013-03-21
16:03:15
0.0 0.5 0.0
358 days
03:59:49
511266472 509711880.0
2018-09-19
20:09:36
2018-02-20
12:02:32
0.0 0.5 0.0
211 days
08:07:04
13
EDA
We considered two time parameters as well.
● Action-Publish:
○ If it is less than 10 days 1.5
○ If it is more than 10 days 1
● Action-Today:
1 + (Action Year - 2011) / 10 + Action Month * 0.025
Example: 2016-5 : 1 + (2016-2011)/10 + (5*0.025) 1.625
14
EDA
USER_ID TRACK_ID C_DATE LIKED DOWNLOAD PURCHASE Action-Today Action-Publish
11128390 6625384 2014-09-23
20:09:18
0 0.5 0 1.375 1
10481467 7014550 2014-08-01
12:08:21
0 0.5 0 1.35 1
38359414 2832153 2015-06-27
17:06:34
0 0 1 1.45 1
102312590 513326911 2018-05-30
19:05:49
0 0.5 0 1.75 1.5
12904873 2855686 2016-06-04
15:06:39
0 1 0 1.55 1
Evaluation: Outline of Online and Offline
Impression
15
Model
Recommendation List
1- Track A
2- Track B
3- Track C
Evaluate
1- Track A
2- Track B
1- Track A
2- Track B
Dataset for Recommendation
Dataset for
Recommendation
Model
1- Track A
2- Track B
3- Track C
Recommendation List
Compare
Correct Data
OnlineOffline
16
Evaluation
● Split to train/test data using date (e.g. a year)
○ From 70.000.000 action records from 2011 to 2020:
■ Train data -> 2019 -> ~ 10.000.000 actions
■ Test data -> 2020 -> ~ 1.000.000 actions
17
Evaluation
User_ID Track_ID Impresion Duration(Normal) Price(Normal)
46170847 6034074 2.7 1.665000 1.160
469026688 6036881 1.3 0.331667 1.245
Implicit
Explicit
18
Evaluation
● Evaluate the model by computing the
Mean Absolute Error (MAE) on the test data
19
Evaluation: RegressionEvaluator (pyspark.ml.evaluation)
20
Model: Model Based CF
● Compute a Correlation Score for every column pair in the matrix
● This gives us a Correlation Score between every pair of track
● Too long to compute
● Sparseness
● Scalability
21
Model: Memory Based CF
MLlib
● classification: logistic regression, linear
SVM, naive bayes
● regression
● clustering: k-means
● collaborative filtering: alternating least
squares (ALS)
22
23
Model: Matrix Factorization of User-Item Matrix
4.5 2.0
4.0 3.5
5.0 2.0
3.5 4.0 1.0
User
Item
1.2 0.8
1.4 0.9
1.5 1.0
1.2 0.8
1.5 1.2 1.0 0.8
1.7 0.6 1.1 0.4
= x
User Matrix
Item Matrix
W X Y Z
A
B
C
D
W X Y ZA
B
C
D
24
Model: Matrix Factorization of User-Item Matrix
● Latent factors are the features in the lower dimension
latent space projected from user-item interaction
matrix.
● Matrix factorization is one of very effective dimension
reduction techniques in machine learning.
25
Model: ALS
Alternating Least Square (ALS) is also
a matrix factorization algorithm and it
runs itself in a parallel fashion.
26
Model: ALS
● Solve scalability and sparseness of the Ratings data
● It’s simple and scales well to very large datasets
27
Optimization: Grid Search to Find Model Best Parameters
Latent Factors Regularization Max Iterations MAE
50 0.1 10 0.7106074619
50 0.15 10 0.7042538296
50 0.2 10 0.7041747462
50 0.25 10 0.7087336317
100 0.1 10 0.7076845943
100 0.15 10 0.7036464454
100 0.2 10 0.7048896137
100 0.25 10 0.7097821267
150 0.1 10 0.7076660709
150 0.15 10 0.7035191506
150 0.2 10 0.704869857
150 0.25 10 0.7097177331
200 0.1 10 0.7077100207
→ Best
Parameters
28
Optimization: Evaluation
Content-Based Model Similar Tracks
Collaborative Filtering
User History
Expected
Tracks
User Recs
Compare
29
Optimization: EvaluationOptimization: Evaluation Result
● Using 2019 data as train → 10M
● Using 2020 data as test → 1M
● Running on the Old Result:
○ Total Users: 116526
○ Mean Score: 0.01%
○ Max Score: 40%
● Using all data → 70M
● Add date coefficients
● Finding Best Parameters
● Running on the New Result:
○ Total Users: 577457
○ Mean Score: 1.1%
○ Max Score: 50%
x110 improvement based on new data
30
Recommender Systems
Content Based Collaborative Filtering
Memory BasedModel Based
31
EDA: Tracks Collection
TRACK_ID TIME_CREATED PRICE PUBLISH_DATE ALBUM_ID duration TYPE_KEY_CURATION TYPE_KEY_GENR
E
TAG_ID_37995085
1
TAG_ID_37995085
2
6034074 12/21/13 18:12 9990 10/23/13 16:10 6032170 232 0 1 0 1
6036881 12/22/13 15:12 1990 10/23/13 16:10 6012439 249 0 1 0 0
6037213 12/22/13 17:12 0 10/23/13 16:10 2828262 192 1 1 0 0
6049227 12/25/13 15:12 0 3/21/13 16:03 6048970 203 0 1 0 0
6059662 12/28/13 15:12 8990 1/1/12 1:01 6059612 549 0 1 1 0
32
Model: Content-Based Filtering
● Item profile for each track we should construct a vector
based on it’s features like tags and artists it has
● User profile for each user we need a vector that shows
his interests based on ratings or likes and downloads
33
Model: Content-Based Filtering/Item Profile
0
1
1
0
0
Item vector
Feature 1
Feature 2
Feature 3
Feature 4
Feature 5
34
Model: Content-Based Filtering/User profile
● User has rated items with profiles i1
, i2,
i3,
... ,
in
● One approach is weighted average of rated item
profiles
35
Model: Content-Based Filtering/User profile
● Items are songs, only feature is “tag”
● Item profile: vector with 0 or 1 for each Actor
● Suppose user x has downloaded or liked 5 songs
● 2 songs featuring TAG A
● 3 songs featuring TAG B
● User profile = mean of item profiles
● Feature A’s weight = 2/5 = 0.4
● Feature B’s weight = 3/5 = 0.6
36
Model: Content-Based Filtering/User profile
0.4
0.6
0
0
...
User vector
Tag A
Tag B
Tag C
Tag D
...
37
Model : Content-Based Filtering/Cosine similarity
● Estimate U(x,i) = cos(θ) = (x . i)/( |x| |i| )
θ
38
Pros
● No need for data on OTHER users
● Able to recommend to users with unique tastes
● Able to recommend new & unpopular items
○ No first-rater problem
● Explanations for recommended items
39
Cons
● Finding the appropriate features is hard
● Overspecialization
○ Never recommends items outside user’s content
profile
40
Evaluation: Sample Track
Homayoun Shajarian & Alireza Ghorbani
“Afsane Chashmhayat”
Genre: Persian Traditional Music
41
Evaluation: Similar Tracks Found by Content-Based Model
● Homayoun Shajarian - “Jana Be Negahi”
● Gholam-Hossein Banan - “Shart e Rah” Ballad
● Mohammad-Reza Shajarian - “Ah Baran”
● Alireza Eftekhari - “Asiri” Ballad
● Gholam-Hossein Banan - “Meykade Arezoo” Ballad
● Salar Aghili - “Daghe Jodaei”
● Homayoun Shajarian - “Be Tamashaye Negahat”
Genre:
Persian
Traditional
Music &
Ballad
+ Same
Artists in
Some Items
Spark Job
42
#!/bin/bash
HOME=/home/rc12g2
HADOOP_HOME=/user/rc12g2
source $HOME/.bashrc
echo "spark job -> started"
hadoop fs -rm -r -f $HADOOP_HOME/final-actions-v3
spark-submit $HOME/beeptunes_recsys/collaborative-filtering/final-action-aggregator.py > $HOME/log/spark.log
hadoop fs -getmerge $HADOOP_HOME/user_recs_v3 $HOME/result/user_recs_v3.csv
hadoop fs -rm -r -f $HADOOP_HOME/user_recs_v3
sed -i '1USER_ID,RECOMMENDATION_IDS' $HOME/result/user_recs_v3.csv
mongoimport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c user_rec --headerline --drop
$HOME/result/user_recs_v3.csv
echo "spark run -> complete"
Track Collection Update Job
43
#!/bin/bash
HOME=/home/rc12g2
echo "update_track_collection job -> started"
cd $HOME/beeptunes_recsys/jobs
mongoexport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c tracks --fieldFile
tracks_fields.txt --out ./export.csv > $HOME/log/update_track_collection.log
python update_trackCollection.py
mongo --port 28018 -u 4max -p HFmk87Q3DgfEKKgC g2recsys --eval 'db.tracks.drop()'
rm $HOME/beeptunes_recsys/jobs/export.csv
echo "update_track_collection job -> ended"
Crontab
44
rc12g2@BIHADOOP-Master1:~$ crontab -l
# m h dom mon dow command
0 2 1 * * /home/beeptunes_recsys/jobs/trackUpdate_job.sh
0 2 1 * * /home/beeptunes_recsys/jobs/spark_job.sh
Cold Start Problem?
45
Hybrid Model
46
/user/<user_id>/recommend
USER_ID Entered
Recommender
System
Trends**
Collection
Collaborative
Filtering
Content-Based
User Exists User Doesn’t Exists
<16* actions >=16* actions
*25% of users have less than 16 actions **Top 30 Tracks with Most Actions in the last 3 months
Hybrid Model
47
/track/<track_id>/similars
TRACK_ID
Entered
Recommender
System
Trends Collection
Content-Based
Track Exists Track Doesn’t Exists
Extract Similars
48
49
Niloufar
Farajpour
Mohamadreza
Fereydooni
Mohamadreza
Kiani
Tadeh
Alexani
Our Great Team!
50
Thank You!
51

More Related Content

What's hot

ML Zoomcamp 1.10 - Summary
ML Zoomcamp 1.10 - SummaryML Zoomcamp 1.10 - Summary
ML Zoomcamp 1.10 - Summary
Alexey Grigorev
 
How to think like a startup
How to think like a startupHow to think like a startup
How to think like a startup
Loic Le Meur
 
Google Product Manager Interview Cheat Sheet
Google Product Manager Interview Cheat SheetGoogle Product Manager Interview Cheat Sheet
Google Product Manager Interview Cheat Sheet
Lewis Lin 🦊
 
Customer Development Strategies by Amazon Sr PM
Customer Development Strategies by Amazon Sr PMCustomer Development Strategies by Amazon Sr PM
Customer Development Strategies by Amazon Sr PM
Product School
 
The Five Pillar Go-To-Market Strategy
The Five Pillar Go-To-Market StrategyThe Five Pillar Go-To-Market Strategy
The Five Pillar Go-To-Market Strategy
Stan Monlux
 
Sales &amp; marketing plan automotive and manufacturing (erp)
Sales &amp; marketing plan  automotive and manufacturing (erp)Sales &amp; marketing plan  automotive and manufacturing (erp)
Sales &amp; marketing plan automotive and manufacturing (erp)
Siddharth Adholia
 
10 Color Banner Design Inspiration
10 Color Banner Design Inspiration10 Color Banner Design Inspiration
10 Color Banner Design Inspiration
Bannersnack
 
Presentation логістика
Presentation логістикаPresentation логістика
Presentation логістика
Doshin_Boy
 
Fundraising-Pitch deck template raising seed capital
Fundraising-Pitch deck template raising seed capitalFundraising-Pitch deck template raising seed capital
Fundraising-Pitch deck template raising seed capital
Rohit Jain
 

What's hot (9)

ML Zoomcamp 1.10 - Summary
ML Zoomcamp 1.10 - SummaryML Zoomcamp 1.10 - Summary
ML Zoomcamp 1.10 - Summary
 
How to think like a startup
How to think like a startupHow to think like a startup
How to think like a startup
 
Google Product Manager Interview Cheat Sheet
Google Product Manager Interview Cheat SheetGoogle Product Manager Interview Cheat Sheet
Google Product Manager Interview Cheat Sheet
 
Customer Development Strategies by Amazon Sr PM
Customer Development Strategies by Amazon Sr PMCustomer Development Strategies by Amazon Sr PM
Customer Development Strategies by Amazon Sr PM
 
The Five Pillar Go-To-Market Strategy
The Five Pillar Go-To-Market StrategyThe Five Pillar Go-To-Market Strategy
The Five Pillar Go-To-Market Strategy
 
Sales &amp; marketing plan automotive and manufacturing (erp)
Sales &amp; marketing plan  automotive and manufacturing (erp)Sales &amp; marketing plan  automotive and manufacturing (erp)
Sales &amp; marketing plan automotive and manufacturing (erp)
 
10 Color Banner Design Inspiration
10 Color Banner Design Inspiration10 Color Banner Design Inspiration
10 Color Banner Design Inspiration
 
Presentation логістика
Presentation логістикаPresentation логістика
Presentation логістика
 
Fundraising-Pitch deck template raising seed capital
Fundraising-Pitch deck template raising seed capitalFundraising-Pitch deck template raising seed capital
Fundraising-Pitch deck template raising seed capital
 

Similar to BeepTunes Music Recommender System

Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
Carlos Castillo (ChaTo)
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptx
Jadna Almeida
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptx
Jadna Almeida
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
AlphaIssaghaDiallo
 
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Charalampos Chelmis
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
Andy Stretton
 
Kddcup2011
Kddcup2011Kddcup2011
Kddcup2011
Liang Xiang
 
Practical Recommendation System - Scalable Machine Learning
Practical Recommendation System - Scalable Machine LearningPractical Recommendation System - Scalable Machine Learning
Practical Recommendation System - Scalable Machine Learning
Son Phan
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
Lucas Jellema
 
A Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation SystemA Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation System
Seval Çapraz
 
Btp 3rd Report
Btp 3rd ReportBtp 3rd Report
Btp 3rd Report
Dinesh Yadav
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
Axel de Romblay
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
Xavier Amatriain
 
The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014
multimediaeval
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
Axel de Romblay
 
Casper Radil - Doing Personas in Analytics
Casper Radil - Doing Personas in AnalyticsCasper Radil - Doing Personas in Analytics
Casper Radil - Doing Personas in Analytics
IIHEvents
 
Combining machine learning and search through learning to rank
Combining machine learning and search through learning to rankCombining machine learning and search through learning to rank
Combining machine learning and search through learning to rank
Jettro Coenradie
 
Fashiondatasc
FashiondatascFashiondatasc
Building Intelligent Workplace Limits and Challenges RIGA COMM 2023
Building Intelligent Workplace Limits and Challenges RIGA COMM 2023 Building Intelligent Workplace Limits and Challenges RIGA COMM 2023
Building Intelligent Workplace Limits and Challenges RIGA COMM 2023
Muntis Rudzitis
 
Slides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data PerspectivesSlides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data Perspectives
Parang Saraf
 

Similar to BeepTunes Music Recommender System (20)

Recommender Systems
Recommender SystemsRecommender Systems
Recommender Systems
 
Rokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptxRokach-GomaxSlides (1).pptx
Rokach-GomaxSlides (1).pptx
 
Rokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptxRokach-GomaxSlides.pptx
Rokach-GomaxSlides.pptx
 
IntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdfIntroductionRecommenderSystems_Petroni.pdf
IntroductionRecommenderSystems_Petroni.pdf
 
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
Exploring Generative Models of Tripartite Graphs for Recommendation in Social...
 
Ordering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect dataOrdering the chaos: Creating websites with imperfect data
Ordering the chaos: Creating websites with imperfect data
 
Kddcup2011
Kddcup2011Kddcup2011
Kddcup2011
 
Practical Recommendation System - Scalable Machine Learning
Practical Recommendation System - Scalable Machine LearningPractical Recommendation System - Scalable Machine Learning
Practical Recommendation System - Scalable Machine Learning
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 
A Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation SystemA Content Boosted Hybrid Recommendation System
A Content Boosted Hybrid Recommendation System
 
Btp 3rd Report
Btp 3rd ReportBtp 3rd Report
Btp 3rd Report
 
[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems[UPDATE] Udacity webinar on Recommendation Systems
[UPDATE] Udacity webinar on Recommendation Systems
 
Cikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business ValueCikm 2013 - Beyond Data From User Information to Business Value
Cikm 2013 - Beyond Data From User Information to Business Value
 
The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014The Search and Hyperlinking Task at MediaEval 2014
The Search and Hyperlinking Task at MediaEval 2014
 
Udacity webinar on Recommendation Systems
Udacity webinar on Recommendation SystemsUdacity webinar on Recommendation Systems
Udacity webinar on Recommendation Systems
 
Casper Radil - Doing Personas in Analytics
Casper Radil - Doing Personas in AnalyticsCasper Radil - Doing Personas in Analytics
Casper Radil - Doing Personas in Analytics
 
Combining machine learning and search through learning to rank
Combining machine learning and search through learning to rankCombining machine learning and search through learning to rank
Combining machine learning and search through learning to rank
 
Fashiondatasc
FashiondatascFashiondatasc
Fashiondatasc
 
Building Intelligent Workplace Limits and Challenges RIGA COMM 2023
Building Intelligent Workplace Limits and Challenges RIGA COMM 2023 Building Intelligent Workplace Limits and Challenges RIGA COMM 2023
Building Intelligent Workplace Limits and Challenges RIGA COMM 2023
 
Slides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data PerspectivesSlides: Safeguarding Abila through Multiple Data Perspectives
Slides: Safeguarding Abila through Multiple Data Perspectives
 

Recently uploaded

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
kumardaparthi1024
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
fredae14
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
SitimaJohn
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
Zilliz
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
GDSC PJATK
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
Postman
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
Hiroshi SHIBATA
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
Pravash Chandra Das
 

Recently uploaded (20)

Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
GenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizationsGenAI Pilot Implementation in the organizations
GenAI Pilot Implementation in the organizations
 
Recommendation System using RAG Architecture
Recommendation System using RAG ArchitectureRecommendation System using RAG Architecture
Recommendation System using RAG Architecture
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptxOcean lotus Threat actors project by John Sitima 2024 (1).pptx
Ocean lotus Threat actors project by John Sitima 2024 (1).pptx
 
Programming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup SlidesProgramming Foundation Models with DSPy - Meetup Slides
Programming Foundation Models with DSPy - Meetup Slides
 
Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!Finale of the Year: Apply for Next One!
Finale of the Year: Apply for Next One!
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
WeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation TechniquesWeTestAthens: Postman's AI & Automation Techniques
WeTestAthens: Postman's AI & Automation Techniques
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Introduction of Cybersecurity with OSS at Code Europe 2024
Introduction of Cybersecurity with OSS  at Code Europe 2024Introduction of Cybersecurity with OSS  at Code Europe 2024
Introduction of Cybersecurity with OSS at Code Europe 2024
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Operating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptxOperating System Used by Users in day-to-day life.pptx
Operating System Used by Users in day-to-day life.pptx
 

BeepTunes Music Recommender System

  • 1. Building a Music Recommender System from Scratch on the Beep Tunes Dataset Niloufar Farajpour, Mohamadreza Kiani, Mohamadreza Fereydooni, Tadeh Alexani 1 Rahnema College - Winter 2020
  • 2. Frank Kane As a data scientist, question the results, because, often there is something you missed. 2
  • 4. Approaches ● Manual Curation by Experts ● Editorial Tagging ● Audio Signals ● Recommender Systems 4
  • 5. 5 Recommender Systems Content Based Collaborative Filtering Memory BasedModel Based
  • 6. 6 CollaborativeFiltering(CF) Model Based Memory Based Find similar users based on cosine similarity or pearson correlation and take weighted avg. of ratings Use machine learning to find user ratings of unrated items. e.g. PCA, SVD, Neural Nets, Matrix Factorization Advantage Easy creation and explainability of results Disadvantage Performance Reduces when data is sparse. So, non scalable Advantage Dimensionality reduction deals with missing/ sparse data Disadvantage Inference is intractable because of hidden/ latent factors
  • 7. 7 User-based vs Item-based Users who are similar to you also liked … Users who liked this item also liked …
  • 8. 8 EDA ● Given Dataset: ○ album_like.csv ○ album_track_purchase.csv ○ album_artist.info ○ download_album.csv ○ artist_like.csv ○ track_download.csv ○ track_like.csv ○ track_info.csv ○ track_tag.csv ○ track_artist.csv
  • 10. 10 EDA USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE 8681411 553605400.0 2019-11-19 11:11:29 2019-10-23 00:10:13 0.0 1.0 0.0 46170847 550442677.0 2019-06-19 09:06:27 2018-05-22 09:05:20 0.0 1.0 0.0 469026688 486718552.0 2018-06-29 23:06:07 2017-07-23 16:07:31 0.0 1.0 0.0 6644142 6058526.0 2014-03-14 20:03:04 2013-03-21 16:03:15 0.0 0.5 0.0 511266472 509711880.0 2018-09-19 20:09:36 2018-02-20 12:02:32 0.0 0.5 0.0 ActionCollection
  • 11. 11 EDA Impression = (Like*4) + (Download*2) + (Purchase*1) Based on the inverse of like, download, purchase frequency among all actions
  • 12. 12 EDA Downloaded Tracks Purchased Tracks Likes USER_ID TRACK_ID C_DATE PUBLISH_DATE LIKED DOWNLOAD PURCHASE OCCURED_ AFTER 8681411 553605400.0 2019-11-19 11:11:29 2019-10-23 00:10:13 0.0 1.0 0.0 27 days 11:01:16 46170847 550442677.0 2019-06-19 09:06:27 2018-05-22 09:05:20 0.0 1.0 0.0 393 days 00:01:07 469026688 486718552.0 2018-06-29 23:06:07 2017-07-23 16:07:31 0.0 1.0 0.0 341 days 06:58:36 6644142 6058526.0 2014-03-14 20:03:04 2013-03-21 16:03:15 0.0 0.5 0.0 358 days 03:59:49 511266472 509711880.0 2018-09-19 20:09:36 2018-02-20 12:02:32 0.0 0.5 0.0 211 days 08:07:04
  • 13. 13 EDA We considered two time parameters as well. ● Action-Publish: ○ If it is less than 10 days 1.5 ○ If it is more than 10 days 1 ● Action-Today: 1 + (Action Year - 2011) / 10 + Action Month * 0.025 Example: 2016-5 : 1 + (2016-2011)/10 + (5*0.025) 1.625
  • 14. 14 EDA USER_ID TRACK_ID C_DATE LIKED DOWNLOAD PURCHASE Action-Today Action-Publish 11128390 6625384 2014-09-23 20:09:18 0 0.5 0 1.375 1 10481467 7014550 2014-08-01 12:08:21 0 0.5 0 1.35 1 38359414 2832153 2015-06-27 17:06:34 0 0 1 1.45 1 102312590 513326911 2018-05-30 19:05:49 0 0.5 0 1.75 1.5 12904873 2855686 2016-06-04 15:06:39 0 1 0 1.55 1
  • 15. Evaluation: Outline of Online and Offline Impression 15 Model Recommendation List 1- Track A 2- Track B 3- Track C Evaluate 1- Track A 2- Track B 1- Track A 2- Track B Dataset for Recommendation Dataset for Recommendation Model 1- Track A 2- Track B 3- Track C Recommendation List Compare Correct Data OnlineOffline
  • 16. 16 Evaluation ● Split to train/test data using date (e.g. a year) ○ From 70.000.000 action records from 2011 to 2020: ■ Train data -> 2019 -> ~ 10.000.000 actions ■ Test data -> 2020 -> ~ 1.000.000 actions
  • 17. 17 Evaluation User_ID Track_ID Impresion Duration(Normal) Price(Normal) 46170847 6034074 2.7 1.665000 1.160 469026688 6036881 1.3 0.331667 1.245 Implicit Explicit
  • 18. 18 Evaluation ● Evaluate the model by computing the Mean Absolute Error (MAE) on the test data
  • 20. 20 Model: Model Based CF ● Compute a Correlation Score for every column pair in the matrix ● This gives us a Correlation Score between every pair of track ● Too long to compute ● Sparseness ● Scalability
  • 21. 21 Model: Memory Based CF MLlib ● classification: logistic regression, linear SVM, naive bayes ● regression ● clustering: k-means ● collaborative filtering: alternating least squares (ALS)
  • 22. 22
  • 23. 23 Model: Matrix Factorization of User-Item Matrix 4.5 2.0 4.0 3.5 5.0 2.0 3.5 4.0 1.0 User Item 1.2 0.8 1.4 0.9 1.5 1.0 1.2 0.8 1.5 1.2 1.0 0.8 1.7 0.6 1.1 0.4 = x User Matrix Item Matrix W X Y Z A B C D W X Y ZA B C D
  • 24. 24 Model: Matrix Factorization of User-Item Matrix ● Latent factors are the features in the lower dimension latent space projected from user-item interaction matrix. ● Matrix factorization is one of very effective dimension reduction techniques in machine learning.
  • 25. 25 Model: ALS Alternating Least Square (ALS) is also a matrix factorization algorithm and it runs itself in a parallel fashion.
  • 26. 26 Model: ALS ● Solve scalability and sparseness of the Ratings data ● It’s simple and scales well to very large datasets
  • 27. 27 Optimization: Grid Search to Find Model Best Parameters Latent Factors Regularization Max Iterations MAE 50 0.1 10 0.7106074619 50 0.15 10 0.7042538296 50 0.2 10 0.7041747462 50 0.25 10 0.7087336317 100 0.1 10 0.7076845943 100 0.15 10 0.7036464454 100 0.2 10 0.7048896137 100 0.25 10 0.7097821267 150 0.1 10 0.7076660709 150 0.15 10 0.7035191506 150 0.2 10 0.704869857 150 0.25 10 0.7097177331 200 0.1 10 0.7077100207 → Best Parameters
  • 28. 28 Optimization: Evaluation Content-Based Model Similar Tracks Collaborative Filtering User History Expected Tracks User Recs Compare
  • 29. 29 Optimization: EvaluationOptimization: Evaluation Result ● Using 2019 data as train → 10M ● Using 2020 data as test → 1M ● Running on the Old Result: ○ Total Users: 116526 ○ Mean Score: 0.01% ○ Max Score: 40% ● Using all data → 70M ● Add date coefficients ● Finding Best Parameters ● Running on the New Result: ○ Total Users: 577457 ○ Mean Score: 1.1% ○ Max Score: 50% x110 improvement based on new data
  • 30. 30 Recommender Systems Content Based Collaborative Filtering Memory BasedModel Based
  • 31. 31 EDA: Tracks Collection TRACK_ID TIME_CREATED PRICE PUBLISH_DATE ALBUM_ID duration TYPE_KEY_CURATION TYPE_KEY_GENR E TAG_ID_37995085 1 TAG_ID_37995085 2 6034074 12/21/13 18:12 9990 10/23/13 16:10 6032170 232 0 1 0 1 6036881 12/22/13 15:12 1990 10/23/13 16:10 6012439 249 0 1 0 0 6037213 12/22/13 17:12 0 10/23/13 16:10 2828262 192 1 1 0 0 6049227 12/25/13 15:12 0 3/21/13 16:03 6048970 203 0 1 0 0 6059662 12/28/13 15:12 8990 1/1/12 1:01 6059612 549 0 1 1 0
  • 32. 32 Model: Content-Based Filtering ● Item profile for each track we should construct a vector based on it’s features like tags and artists it has ● User profile for each user we need a vector that shows his interests based on ratings or likes and downloads
  • 33. 33 Model: Content-Based Filtering/Item Profile 0 1 1 0 0 Item vector Feature 1 Feature 2 Feature 3 Feature 4 Feature 5
  • 34. 34 Model: Content-Based Filtering/User profile ● User has rated items with profiles i1 , i2, i3, ... , in ● One approach is weighted average of rated item profiles
  • 35. 35 Model: Content-Based Filtering/User profile ● Items are songs, only feature is “tag” ● Item profile: vector with 0 or 1 for each Actor ● Suppose user x has downloaded or liked 5 songs ● 2 songs featuring TAG A ● 3 songs featuring TAG B ● User profile = mean of item profiles ● Feature A’s weight = 2/5 = 0.4 ● Feature B’s weight = 3/5 = 0.6
  • 36. 36 Model: Content-Based Filtering/User profile 0.4 0.6 0 0 ... User vector Tag A Tag B Tag C Tag D ...
  • 37. 37 Model : Content-Based Filtering/Cosine similarity ● Estimate U(x,i) = cos(θ) = (x . i)/( |x| |i| ) θ
  • 38. 38 Pros ● No need for data on OTHER users ● Able to recommend to users with unique tastes ● Able to recommend new & unpopular items ○ No first-rater problem ● Explanations for recommended items
  • 39. 39 Cons ● Finding the appropriate features is hard ● Overspecialization ○ Never recommends items outside user’s content profile
  • 40. 40 Evaluation: Sample Track Homayoun Shajarian & Alireza Ghorbani “Afsane Chashmhayat” Genre: Persian Traditional Music
  • 41. 41 Evaluation: Similar Tracks Found by Content-Based Model ● Homayoun Shajarian - “Jana Be Negahi” ● Gholam-Hossein Banan - “Shart e Rah” Ballad ● Mohammad-Reza Shajarian - “Ah Baran” ● Alireza Eftekhari - “Asiri” Ballad ● Gholam-Hossein Banan - “Meykade Arezoo” Ballad ● Salar Aghili - “Daghe Jodaei” ● Homayoun Shajarian - “Be Tamashaye Negahat” Genre: Persian Traditional Music & Ballad + Same Artists in Some Items
  • 42. Spark Job 42 #!/bin/bash HOME=/home/rc12g2 HADOOP_HOME=/user/rc12g2 source $HOME/.bashrc echo "spark job -> started" hadoop fs -rm -r -f $HADOOP_HOME/final-actions-v3 spark-submit $HOME/beeptunes_recsys/collaborative-filtering/final-action-aggregator.py > $HOME/log/spark.log hadoop fs -getmerge $HADOOP_HOME/user_recs_v3 $HOME/result/user_recs_v3.csv hadoop fs -rm -r -f $HADOOP_HOME/user_recs_v3 sed -i '1USER_ID,RECOMMENDATION_IDS' $HOME/result/user_recs_v3.csv mongoimport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c user_rec --headerline --drop $HOME/result/user_recs_v3.csv echo "spark run -> complete"
  • 43. Track Collection Update Job 43 #!/bin/bash HOME=/home/rc12g2 echo "update_track_collection job -> started" cd $HOME/beeptunes_recsys/jobs mongoexport --port 28018 -u 4max -p HFmk87Q3DgfEKKgC --type csv -d g2recsys -c tracks --fieldFile tracks_fields.txt --out ./export.csv > $HOME/log/update_track_collection.log python update_trackCollection.py mongo --port 28018 -u 4max -p HFmk87Q3DgfEKKgC g2recsys --eval 'db.tracks.drop()' rm $HOME/beeptunes_recsys/jobs/export.csv echo "update_track_collection job -> ended"
  • 44. Crontab 44 rc12g2@BIHADOOP-Master1:~$ crontab -l # m h dom mon dow command 0 2 1 * * /home/beeptunes_recsys/jobs/trackUpdate_job.sh 0 2 1 * * /home/beeptunes_recsys/jobs/spark_job.sh
  • 46. Hybrid Model 46 /user/<user_id>/recommend USER_ID Entered Recommender System Trends** Collection Collaborative Filtering Content-Based User Exists User Doesn’t Exists <16* actions >=16* actions *25% of users have less than 16 actions **Top 30 Tracks with Most Actions in the last 3 months
  • 48. 48
  • 49. 49