SlideShare a Scribd company logo
1 of 24
Capstone Report
Kaggle Competition: Outbrain
Click Prediction
Vishal Changrani
Springboard Data Science Intensive
03/21/2017
Problem definition
• Outbrain Click Prediction competition hosted by Kaggle
• https://www.kaggle.com/c/outbrain-click-prediction
• Predict which of the ads has the most likelihood of being clicked
Evaluation metric
• Submission were evaluated using the Mean Average Precision @12(MAP@12)
where |U| is the number of display_ids, P(k) is the precision at cutoff k and n is the number of
predicted ad_ids.
• The Kaggle leaderboard for this competition had a maximum MAP@12 of
0.70145.
• MAP is the mean precision calculated over the average precision (AP) of each
query. The AP at k of each query in turn is a percentage of correct items among
first k recommendations. Since here the maximum number of ads is twelve, k =
12.
Structure of data
• Clicks_train was the training data for which the dependent variable – ‘clicked’ was provided.
• Clicks_test was the test data for which the clicked probability needed to be predicted.
• Training and test data was provided for a 15 day window – 14th June to 28th June 2016.
• There was meta data associated with both the source document on which the promoted content
is shown and the target document to which the promoted content points.
• The meta data included the possible topics, categories and entities the document may belong to.
One document could belong to more than one topic, more than one category and more than one
entity. The relationship was quantified using a measure called ‘confidence_level’ which ranged
from 0 to 1 depending on how confident Outbrain was in classifying the document with that
topic, category or entity.
• User attributes such as platform (pc, smart phone or tablet), location (region and country), UUID
and time of access was also provided.
Data Cleansing
• Not much of data cleansing was needed
• Simple cleansing such as making the user platform id coherent (1, 2 or 3) and dropping 5
entries due to missing platform id was performed on the events dataframe.
Data Exploration: Number of ads
• Between 2 to 12 ads were shown each time
• Displays with 4 or 6 ads were at least thrice more likely than displays of any
other number of ads
Data Exploration: Click through rate (CTR)
• The CTR distribution was right skewed and 14% on average.
• The low CTR can be attributed to the fact that for each display only one of the x number of
ads displayed were clicked.
• 53% ads were clicked at least once or in other words 47% of ads were never clicked. Hence
any ad had an approximately 50-50 chance of being clicked on average.
Data Exploration: Ad repetitions
• Each Ad was repeated on an average 182 times
• The range of ad repetition was from 1 to 211824 (inclusive).
• The median number of times an ad was repeated was 5.
• The large disparity in the number of times an ad was repeated indicated that the ad selection
logic may be skewed one way or the other.
• The correlation between the number of times an ad was repeated to its CTR was negligible
(0.02) indicating that ad repetition didn’t affect its probability of being clicked one way or the
other. The correlation for ads that were clicked at least once and their CTR was negative but
negligible as well (-0.04).
Data Exploration: Clicks distribution (daily)
• The number of ads clicked daily remained fairly constant throughout the 15-day
period.
• The prediction needed to be done for the complete 15-day period however for the
last two days – 27th and 28th June there was no training data available.
Data Exploration: Clicks distribution (hourly)
• Activity picked up after 8:00am in the morning till around 5:00pm in the
evening which are the regular business hours and then around 6:00 pm to
11:00pm in the night
Data Exploration: Source of user traffic
• Most of the user traffic originated in the US
• Within US, user requests mainly came from California
Data Exploration: Clicks distribution (hourly US only)
User traffic was more during normal business hours and during the evenings
(Distribution of ads during the day only for the US such that all times have been normalized to EST such that 8:00 AM in NY is also
8:00 AM in Atlanta and 8:00 AM in California by subtracting 2 hours and 3 hours etc. respectively)
Data Exploration: Source of traffic by platform
Most of the user traffic originated from mobile phone
(1 – PC, 2 – Mobile, 3 – tablet)
Data Exploration: Topics, Categories and Entities
• There are 300 unique topics, 97 unique categories and 1,326,009 unique entities that the
source and the target documents belong to.
• Not all documents have an associated topic or category or entity
• Distribution of topics, categories and entities was sparse.
Number of topics that appear more than 1% times: 19
Number of topics that appear more than 5% times: 1
Number of topics that appear more than 10% times: 0
Number of topics that appear more than 20% times: 0
Number of topics that appear more than 50% times: 0
Number of topics that appear more than 100% times: 0
Number of categories that appear more than 1% times: 28
Number of categories that appear more than 5% times: 4
Number of categories that appear more than 10% times: 1
Number of categories that appear more than 20% times: 0
Number of categories that appear more than 50% times: 0
Number of categories that appear more than 100% times: 0
Number of categories that appear more than 0.5% times: 6
Number of categories that appear more than 1% times: 3
Number of categories that appear more than 5% times: 1
Number of categories that appear more than 10% times: 0
Number of categories that appear more than 20% times: 0
Number of categories that appear more than 50% times: 0
Number of categories that appear more than 100% times: 0
Data Exploration: Data Volume
• The sheer amount of data for this problem was overwhelming.
• I kept running into out-of-memory errors during data exploration and model evaluation.
• Hence I selectively choose only part of the data for further exploration and model building.
# File sizes
page_views_sample.csv 454.35MB
documents_meta.csv 89.38MB
documents_categories.csv 118.02MB
events.csv 1208.55MB
clicks_test.csv 506.95MB
promoted_content.csv 13.89MB
documents_topics.csv 339.47MB
documents_entities.csv 324.1MB
sample_submission.csv 273.14MB
clicks_train.csv 1486.73MB
• Topics, Categories and Entity represent nominal data hence I added dummy variables to the test
and training data for each topic, category and entity.
• To deal with the large feature set thus produced I considered a document belonging to a single
topic, category and entity for which it had the maximum confidence level and dropping the other
document topic, category and entity associations.
• I considered only the top five most frequent category, topic and entity for model building.
Feature Selection: Topic, Category and Entity
• To reduce multicollinearity I performed a Chi-Squared Test of Independence between the Topics,
Categories and the Entities under the assumption that documents of the same categories may
have the same topics and same entities.
• The null hypothesis’ were,
• There is no association between topics and categories
• There is no association between categories and entities
• And the alternate hypothesis’ were,
• Topics and Categories are associated (related) with each other
• Categories and Entities are associated (related) with each other
• P-value for both the chi-square test, topics and categories and categories and entities came out to
be 0
• Thus there is indeed was a relationship between the topics and the categories and categories and
the entities suggesting that all three were related. Hence I just chose only categories as features.
• Hence I just chose only categories as features
Feature Selection: Topic, Category and Entity
Display Size
• The likelihood of any ad being clicked would depend on the total number of ads it is shown with
in a display
• For e.g. if an ad is shown in a display along with one other ad it is more likely to be clicked than if
it is shown with five other ads.
• Hence I chose the size of the display as a feature for the model.
Feature Selection: Display size & Reg. CTR
Regularized CTR
• CTR represents popularity of the ad
• However, CTR for ads that were shown less number of times may be high as compared to those that
were shown more number of times.
• e.g., an ad that was shown four times and was clicked three times will have a CTR of 0.7 while an ad
that was shown a thousand times and clicked six hundred times would have a CTR of 0.6
• Hence I regularized CTR with a regularization factor of 10 with the formula below,
reg_ctr = number of times an ad was clicked / (number of times an ad was shown + regularization factor)
I chose Random Forest as the classifier because,
• It is based on Decision Tree and easy to comprehend.
• Provided me feature selection for free.
• It is an ensemble method which creates several decision trees and then combines their
prediction. Hence I thought it would not overfit and generalize well.
• Has inbuilt support for cross validation.
• Not all features had a linear relationship with the click probability hence logistic
regression may not have been a good fit.
Classifier Selection: Random Forest
• Although I tried to choose all the different hyper-parameters of Random forest such as min_sample_size,
number of estimators etc. using Grid Search I had no luck in getting the optimal values of all of the
parameters using Grid Search since it would run forever even on a 16 core Azure VM. Hence I had to mostly
choose parameters such that the classifier ran in modest amount of time.
Hyper-parameter tuning
The final parameters I passed to Random forest were,
 Criterion: gini (derived via grid searh)
 n_estimators: 3
 max_features: n_features (all features)
 n_jobs: 4 (utilizing 4 cores)
 min_samples_leaf: 0.10 (I added pruning only due to hardware limitations. Without pruning the
depth of the tree would be large and the classifier wouldn’t finish running)
Conclusion
• I created the model using the Random Forest classifier and fitted it on the training data with
features: display_size, reg_ctr, platform, source document and target document values for
dummy variables of the top five most frequent categories. In all thirteen features.
• I got a MAPK score of 0.61415 by using the model over the test data.
• Strangely though Random Forest reported the reg_ctr as the only feature with importance value
as 1 and rest all features had values 0.
• The high importance of reg_ctr indicates that CTR value is one of the most important feature in
ad prediction.
• The sparsity of categories, topics and entities may not qualify them as good features.
• This was my first attempt at a data science problem and I realized that finding good features is as
difficult as tuning the model.
• I learnt how to approach a data science problem, how to explore data and make sense out of it
and how to tune a model.
• Finally, I also learnt that how much fun, interesting and rewarding data science can be demanding
a lot of perseverance at the same time.
Thank you
• Special thanks to my mentor – Ankit Jain!
References
• http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html
• https://www.kaggle.com/c/outbrain-click-prediction/discussion
• https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/
• http://fastml.com/what-you-wanted-to-know-about-mean-average-precision/
• https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
Appendix A
Python notebook available at:
• https://github.com/vishalchangrani/outbrain/blob/master/OutbrainFinalSolution.ipynb
• https://github.com/vishalchangrani/outbrain/blob/master/OutbrainInitial.ipynb
• https://github.com/vishalchangrani/outbrain/blob/master/FeatureAnalysis.ipynb
Capstone Report :
• https://github.com/vishalchangrani/outbrain/blob/master/finalreport/vishalchangrani_Kaggle_outbrain_report.pdf

More Related Content

What's hot

Sweetworxx - How to Acquire 5,000 Customers in Three Months.
Sweetworxx - How to Acquire 5,000 Customers in Three Months.Sweetworxx - How to Acquire 5,000 Customers in Three Months.
Sweetworxx - How to Acquire 5,000 Customers in Three Months.Bill Barrick
 
A GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETING
A GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETINGA GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETING
A GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETINGIJDKP
 
Data mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.comData mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.comAmit Sharma
 
iaetsd Co extracting opinion targets and opinion words from online reviews ba...
iaetsd Co extracting opinion targets and opinion words from online reviews ba...iaetsd Co extracting opinion targets and opinion words from online reviews ba...
iaetsd Co extracting opinion targets and opinion words from online reviews ba...Iaetsd Iaetsd
 
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersPromoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersMounia Lalmas-Roelleke
 
Digital analytics: How-to's in Google Analytics (Lecture 11)
Digital analytics: How-to's in Google Analytics (Lecture 11)Digital analytics: How-to's in Google Analytics (Lecture 11)
Digital analytics: How-to's in Google Analytics (Lecture 11)Joni Salminen
 
Macro Consulting - Sawtooth Conference 2013 Image MD Presentation
Macro Consulting - Sawtooth Conference 2013 Image MD PresentationMacro Consulting - Sawtooth Conference 2013 Image MD Presentation
Macro Consulting - Sawtooth Conference 2013 Image MD Presentationmacroinc
 
How to prepare for data science interviews
How to prepare for data science interviewsHow to prepare for data science interviews
How to prepare for data science interviewsJay (Jianqiang) Wang
 
Digital analytics lecture4
Digital analytics lecture4Digital analytics lecture4
Digital analytics lecture4Joni Salminen
 
Estimating the causal impact of recommender systems
Estimating the causal impact of recommender systemsEstimating the causal impact of recommender systems
Estimating the causal impact of recommender systemsAmit Sharma
 
Comparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentimentComparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentimentAI Publications
 
Churn Prediction in Practice
Churn Prediction in PracticeChurn Prediction in Practice
Churn Prediction in PracticeBigData Republic
 
Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Joni Salminen
 

What's hot (17)

Sweetworxx - How to Acquire 5,000 Customers in Three Months.
Sweetworxx - How to Acquire 5,000 Customers in Three Months.Sweetworxx - How to Acquire 5,000 Customers in Three Months.
Sweetworxx - How to Acquire 5,000 Customers in Three Months.
 
A GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETING
A GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETINGA GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETING
A GRAPH BASED APPROACH FOR EFFECTIVE INFLUENCER MARKETING
 
Data mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.comData mining for causal inference: Effect of recommendations on Amazon.com
Data mining for causal inference: Effect of recommendations on Amazon.com
 
PRSA 2010 International Conference
PRSA 2010 International ConferencePRSA 2010 International Conference
PRSA 2010 International Conference
 
iaetsd Co extracting opinion targets and opinion words from online reviews ba...
iaetsd Co extracting opinion targets and opinion words from online reviews ba...iaetsd Co extracting opinion targets and opinion words from online reviews ba...
iaetsd Co extracting opinion targets and opinion words from online reviews ba...
 
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini UsersPromoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
Promoting Positive Post-click Experience for In-Stream Yahoo Gemini Users
 
Digital analytics: How-to's in Google Analytics (Lecture 11)
Digital analytics: How-to's in Google Analytics (Lecture 11)Digital analytics: How-to's in Google Analytics (Lecture 11)
Digital analytics: How-to's in Google Analytics (Lecture 11)
 
Macro Consulting - Sawtooth Conference 2013 Image MD Presentation
Macro Consulting - Sawtooth Conference 2013 Image MD PresentationMacro Consulting - Sawtooth Conference 2013 Image MD Presentation
Macro Consulting - Sawtooth Conference 2013 Image MD Presentation
 
How to prepare for data science interviews
How to prepare for data science interviewsHow to prepare for data science interviews
How to prepare for data science interviews
 
Digital analytics lecture4
Digital analytics lecture4Digital analytics lecture4
Digital analytics lecture4
 
Slingshot Google vs Bing CTR Study 2012
Slingshot Google vs Bing CTR Study 2012Slingshot Google vs Bing CTR Study 2012
Slingshot Google vs Bing CTR Study 2012
 
Realtimeeval
RealtimeevalRealtimeeval
Realtimeeval
 
Estimating the causal impact of recommender systems
Estimating the causal impact of recommender systemsEstimating the causal impact of recommender systems
Estimating the causal impact of recommender systems
 
Data Science in Digital Marketing - Forest Cassidy, LeadFerret
Data Science in Digital Marketing - Forest Cassidy, LeadFerretData Science in Digital Marketing - Forest Cassidy, LeadFerret
Data Science in Digital Marketing - Forest Cassidy, LeadFerret
 
Comparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentimentComparative Study on Lexicon-based sentiment analysers over Negative sentiment
Comparative Study on Lexicon-based sentiment analysers over Negative sentiment
 
Churn Prediction in Practice
Churn Prediction in PracticeChurn Prediction in Practice
Churn Prediction in Practice
 
Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)Digital analytics: Optimization (Lecture 10)
Digital analytics: Optimization (Lecture 10)
 

Similar to Kaggle Capstone Report on Outbrain Click Prediction Competition

Ad Yield Optimization @ Spotify - DataGotham 2013
Ad Yield Optimization @ Spotify - DataGotham 2013Ad Yield Optimization @ Spotify - DataGotham 2013
Ad Yield Optimization @ Spotify - DataGotham 2013Kinshuk Mishra
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Kira
 
COM 5401Advertising Production and ManagementAdvertisi.docx
COM 5401Advertising Production and ManagementAdvertisi.docxCOM 5401Advertising Production and ManagementAdvertisi.docx
COM 5401Advertising Production and ManagementAdvertisi.docxmccormicknadine86
 
Offer Recommendation methodology for Vito's Mobile App
Offer Recommendation methodology for Vito's Mobile AppOffer Recommendation methodology for Vito's Mobile App
Offer Recommendation methodology for Vito's Mobile AppDipesh Patel
 
Offer recommendation methodology
Offer recommendation methodologyOffer recommendation methodology
Offer recommendation methodologyDipesh Patel
 
Flash Series: Using Ad Testing Data To Set Up Foolproof Tests
Flash Series: Using Ad Testing Data To Set Up Foolproof TestsFlash Series: Using Ad Testing Data To Set Up Foolproof Tests
Flash Series: Using Ad Testing Data To Set Up Foolproof TestsHanapin Marketing
 
Data-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingData-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingJack Nguyen (Hung Tien)
 
Search analytics what why how - By Otis Gospodnetic
 Search analytics what why how - By Otis Gospodnetic  Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic lucenerevolution
 
Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis GospodneticSearch analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodneticlucenerevolution
 
Metrics and Analytics
Metrics and AnalyticsMetrics and Analytics
Metrics and AnalyticsRavi Singh
 
Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...
Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...
Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...Energy Digital Summit
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4RichardGroom
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012TRG Arts
 

Similar to Kaggle Capstone Report on Outbrain Click Prediction Competition (20)

Ad Yield Optimization @ Spotify - DataGotham 2013
Ad Yield Optimization @ Spotify - DataGotham 2013Ad Yield Optimization @ Spotify - DataGotham 2013
Ad Yield Optimization @ Spotify - DataGotham 2013
 
Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)Tutorial 11 (computational advertising)
Tutorial 11 (computational advertising)
 
Customer Needs Assessment
Customer Needs AssessmentCustomer Needs Assessment
Customer Needs Assessment
 
COM 5401Advertising Production and ManagementAdvertisi.docx
COM 5401Advertising Production and ManagementAdvertisi.docxCOM 5401Advertising Production and ManagementAdvertisi.docx
COM 5401Advertising Production and ManagementAdvertisi.docx
 
6.6 Family and Youth Program Measurement Simplified
6.6 Family and Youth Program Measurement Simplified6.6 Family and Youth Program Measurement Simplified
6.6 Family and Youth Program Measurement Simplified
 
1.11 Data and Performance Simplified
1.11 Data and Performance Simplified1.11 Data and Performance Simplified
1.11 Data and Performance Simplified
 
Offer Recommendation methodology for Vito's Mobile App
Offer Recommendation methodology for Vito's Mobile AppOffer Recommendation methodology for Vito's Mobile App
Offer Recommendation methodology for Vito's Mobile App
 
Offer recommendation methodology
Offer recommendation methodologyOffer recommendation methodology
Offer recommendation methodology
 
Flash Series: Using Ad Testing Data To Set Up Foolproof Tests
Flash Series: Using Ad Testing Data To Set Up Foolproof TestsFlash Series: Using Ad Testing Data To Set Up Foolproof Tests
Flash Series: Using Ad Testing Data To Set Up Foolproof Tests
 
2.6 Expert Forum: Data and Performance Simplified
2.6 Expert Forum: Data and Performance Simplified2.6 Expert Forum: Data and Performance Simplified
2.6 Expert Forum: Data and Performance Simplified
 
Google Analytics 101
Google Analytics 101Google Analytics 101
Google Analytics 101
 
Data-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B TestingData-Driven UI/UX Design with A/B Testing
Data-Driven UI/UX Design with A/B Testing
 
Search analytics what why how - By Otis Gospodnetic
 Search analytics what why how - By Otis Gospodnetic  Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic
 
Search analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis GospodneticSearch analytics what why how - By Otis Gospodnetic
Search analytics what why how - By Otis Gospodnetic
 
Metrics and Analytics
Metrics and AnalyticsMetrics and Analytics
Metrics and Analytics
 
Craigslist Masters Class
Craigslist Masters ClassCraigslist Masters Class
Craigslist Masters Class
 
Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...
Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...
Measuring What Really Matters: Search Engine Metrics & Tracking Tips - David ...
 
1.8 Data and Performance Simplified (De Jong)
1.8 Data and Performance Simplified (De Jong)1.8 Data and Performance Simplified (De Jong)
1.8 Data and Performance Simplified (De Jong)
 
Mir 2012 13 session #4
Mir 2012 13 session #4Mir 2012 13 session #4
Mir 2012 13 session #4
 
ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012ASC Marketing Workshop - Mar 2012
ASC Marketing Workshop - Mar 2012
 

Recently uploaded

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationshipsccctableauusergroup
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Serviceranjana rawat
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptSonatrach
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一F La
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...Florian Roscheck
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptxthyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 

Recently uploaded (20)

04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships04242024_CCC TUG_Joins and Relationships
04242024_CCC TUG_Joins and Relationships
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
(PARI) Call Girls Wanowrie ( 7001035870 ) HI-Fi Pune Escorts Service
 
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.pptdokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
dokumen.tips_chapter-4-transient-heat-conduction-mehmet-kanoglu.ppt
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
办理(Vancouver毕业证书)加拿大温哥华岛大学毕业证成绩单原版一比一
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...From idea to production in a day – Leveraging Azure ML and Streamlit to build...
From idea to production in a day – Leveraging Azure ML and Streamlit to build...
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptxEMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM  TRACKING WITH GOOGLE ANALYTICS.pptx
EMERCE - 2024 - AMSTERDAM - CROSS-PLATFORM TRACKING WITH GOOGLE ANALYTICS.pptx
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 

Kaggle Capstone Report on Outbrain Click Prediction Competition

  • 1. Capstone Report Kaggle Competition: Outbrain Click Prediction Vishal Changrani Springboard Data Science Intensive 03/21/2017
  • 2. Problem definition • Outbrain Click Prediction competition hosted by Kaggle • https://www.kaggle.com/c/outbrain-click-prediction • Predict which of the ads has the most likelihood of being clicked
  • 3. Evaluation metric • Submission were evaluated using the Mean Average Precision @12(MAP@12) where |U| is the number of display_ids, P(k) is the precision at cutoff k and n is the number of predicted ad_ids. • The Kaggle leaderboard for this competition had a maximum MAP@12 of 0.70145. • MAP is the mean precision calculated over the average precision (AP) of each query. The AP at k of each query in turn is a percentage of correct items among first k recommendations. Since here the maximum number of ads is twelve, k = 12.
  • 4. Structure of data • Clicks_train was the training data for which the dependent variable – ‘clicked’ was provided. • Clicks_test was the test data for which the clicked probability needed to be predicted. • Training and test data was provided for a 15 day window – 14th June to 28th June 2016. • There was meta data associated with both the source document on which the promoted content is shown and the target document to which the promoted content points. • The meta data included the possible topics, categories and entities the document may belong to. One document could belong to more than one topic, more than one category and more than one entity. The relationship was quantified using a measure called ‘confidence_level’ which ranged from 0 to 1 depending on how confident Outbrain was in classifying the document with that topic, category or entity. • User attributes such as platform (pc, smart phone or tablet), location (region and country), UUID and time of access was also provided.
  • 5. Data Cleansing • Not much of data cleansing was needed • Simple cleansing such as making the user platform id coherent (1, 2 or 3) and dropping 5 entries due to missing platform id was performed on the events dataframe.
  • 6. Data Exploration: Number of ads • Between 2 to 12 ads were shown each time • Displays with 4 or 6 ads were at least thrice more likely than displays of any other number of ads
  • 7. Data Exploration: Click through rate (CTR) • The CTR distribution was right skewed and 14% on average. • The low CTR can be attributed to the fact that for each display only one of the x number of ads displayed were clicked. • 53% ads were clicked at least once or in other words 47% of ads were never clicked. Hence any ad had an approximately 50-50 chance of being clicked on average.
  • 8. Data Exploration: Ad repetitions • Each Ad was repeated on an average 182 times • The range of ad repetition was from 1 to 211824 (inclusive). • The median number of times an ad was repeated was 5. • The large disparity in the number of times an ad was repeated indicated that the ad selection logic may be skewed one way or the other. • The correlation between the number of times an ad was repeated to its CTR was negligible (0.02) indicating that ad repetition didn’t affect its probability of being clicked one way or the other. The correlation for ads that were clicked at least once and their CTR was negative but negligible as well (-0.04).
  • 9. Data Exploration: Clicks distribution (daily) • The number of ads clicked daily remained fairly constant throughout the 15-day period. • The prediction needed to be done for the complete 15-day period however for the last two days – 27th and 28th June there was no training data available.
  • 10. Data Exploration: Clicks distribution (hourly) • Activity picked up after 8:00am in the morning till around 5:00pm in the evening which are the regular business hours and then around 6:00 pm to 11:00pm in the night
  • 11. Data Exploration: Source of user traffic • Most of the user traffic originated in the US • Within US, user requests mainly came from California
  • 12. Data Exploration: Clicks distribution (hourly US only) User traffic was more during normal business hours and during the evenings (Distribution of ads during the day only for the US such that all times have been normalized to EST such that 8:00 AM in NY is also 8:00 AM in Atlanta and 8:00 AM in California by subtracting 2 hours and 3 hours etc. respectively)
  • 13. Data Exploration: Source of traffic by platform Most of the user traffic originated from mobile phone (1 – PC, 2 – Mobile, 3 – tablet)
  • 14. Data Exploration: Topics, Categories and Entities • There are 300 unique topics, 97 unique categories and 1,326,009 unique entities that the source and the target documents belong to. • Not all documents have an associated topic or category or entity • Distribution of topics, categories and entities was sparse. Number of topics that appear more than 1% times: 19 Number of topics that appear more than 5% times: 1 Number of topics that appear more than 10% times: 0 Number of topics that appear more than 20% times: 0 Number of topics that appear more than 50% times: 0 Number of topics that appear more than 100% times: 0 Number of categories that appear more than 1% times: 28 Number of categories that appear more than 5% times: 4 Number of categories that appear more than 10% times: 1 Number of categories that appear more than 20% times: 0 Number of categories that appear more than 50% times: 0 Number of categories that appear more than 100% times: 0 Number of categories that appear more than 0.5% times: 6 Number of categories that appear more than 1% times: 3 Number of categories that appear more than 5% times: 1 Number of categories that appear more than 10% times: 0 Number of categories that appear more than 20% times: 0 Number of categories that appear more than 50% times: 0 Number of categories that appear more than 100% times: 0
  • 15. Data Exploration: Data Volume • The sheer amount of data for this problem was overwhelming. • I kept running into out-of-memory errors during data exploration and model evaluation. • Hence I selectively choose only part of the data for further exploration and model building. # File sizes page_views_sample.csv 454.35MB documents_meta.csv 89.38MB documents_categories.csv 118.02MB events.csv 1208.55MB clicks_test.csv 506.95MB promoted_content.csv 13.89MB documents_topics.csv 339.47MB documents_entities.csv 324.1MB sample_submission.csv 273.14MB clicks_train.csv 1486.73MB
  • 16. • Topics, Categories and Entity represent nominal data hence I added dummy variables to the test and training data for each topic, category and entity. • To deal with the large feature set thus produced I considered a document belonging to a single topic, category and entity for which it had the maximum confidence level and dropping the other document topic, category and entity associations. • I considered only the top five most frequent category, topic and entity for model building. Feature Selection: Topic, Category and Entity
  • 17. • To reduce multicollinearity I performed a Chi-Squared Test of Independence between the Topics, Categories and the Entities under the assumption that documents of the same categories may have the same topics and same entities. • The null hypothesis’ were, • There is no association between topics and categories • There is no association between categories and entities • And the alternate hypothesis’ were, • Topics and Categories are associated (related) with each other • Categories and Entities are associated (related) with each other • P-value for both the chi-square test, topics and categories and categories and entities came out to be 0 • Thus there is indeed was a relationship between the topics and the categories and categories and the entities suggesting that all three were related. Hence I just chose only categories as features. • Hence I just chose only categories as features Feature Selection: Topic, Category and Entity
  • 18. Display Size • The likelihood of any ad being clicked would depend on the total number of ads it is shown with in a display • For e.g. if an ad is shown in a display along with one other ad it is more likely to be clicked than if it is shown with five other ads. • Hence I chose the size of the display as a feature for the model. Feature Selection: Display size & Reg. CTR Regularized CTR • CTR represents popularity of the ad • However, CTR for ads that were shown less number of times may be high as compared to those that were shown more number of times. • e.g., an ad that was shown four times and was clicked three times will have a CTR of 0.7 while an ad that was shown a thousand times and clicked six hundred times would have a CTR of 0.6 • Hence I regularized CTR with a regularization factor of 10 with the formula below, reg_ctr = number of times an ad was clicked / (number of times an ad was shown + regularization factor)
  • 19. I chose Random Forest as the classifier because, • It is based on Decision Tree and easy to comprehend. • Provided me feature selection for free. • It is an ensemble method which creates several decision trees and then combines their prediction. Hence I thought it would not overfit and generalize well. • Has inbuilt support for cross validation. • Not all features had a linear relationship with the click probability hence logistic regression may not have been a good fit. Classifier Selection: Random Forest
  • 20. • Although I tried to choose all the different hyper-parameters of Random forest such as min_sample_size, number of estimators etc. using Grid Search I had no luck in getting the optimal values of all of the parameters using Grid Search since it would run forever even on a 16 core Azure VM. Hence I had to mostly choose parameters such that the classifier ran in modest amount of time. Hyper-parameter tuning The final parameters I passed to Random forest were,  Criterion: gini (derived via grid searh)  n_estimators: 3  max_features: n_features (all features)  n_jobs: 4 (utilizing 4 cores)  min_samples_leaf: 0.10 (I added pruning only due to hardware limitations. Without pruning the depth of the tree would be large and the classifier wouldn’t finish running)
  • 21. Conclusion • I created the model using the Random Forest classifier and fitted it on the training data with features: display_size, reg_ctr, platform, source document and target document values for dummy variables of the top five most frequent categories. In all thirteen features. • I got a MAPK score of 0.61415 by using the model over the test data. • Strangely though Random Forest reported the reg_ctr as the only feature with importance value as 1 and rest all features had values 0. • The high importance of reg_ctr indicates that CTR value is one of the most important feature in ad prediction. • The sparsity of categories, topics and entities may not qualify them as good features. • This was my first attempt at a data science problem and I realized that finding good features is as difficult as tuning the model. • I learnt how to approach a data science problem, how to explore data and make sense out of it and how to tune a model. • Finally, I also learnt that how much fun, interesting and rewarding data science can be demanding a lot of perseverance at the same time.
  • 22. Thank you • Special thanks to my mentor – Ankit Jain!
  • 23. References • http://hamelg.blogspot.com/2015/11/python-for-data-analysis-part-25-chi.html • https://www.kaggle.com/c/outbrain-click-prediction/discussion • https://makarandtapaswi.wordpress.com/2012/07/02/intuition-behind-average-precision-and-map/ • http://fastml.com/what-you-wanted-to-know-about-mean-average-precision/ • https://www.analyticsvidhya.com/blog/2016/04/complete-tutorial-tree-based-modeling-scratch-in-python/
  • 24. Appendix A Python notebook available at: • https://github.com/vishalchangrani/outbrain/blob/master/OutbrainFinalSolution.ipynb • https://github.com/vishalchangrani/outbrain/blob/master/OutbrainInitial.ipynb • https://github.com/vishalchangrani/outbrain/blob/master/FeatureAnalysis.ipynb Capstone Report : • https://github.com/vishalchangrani/outbrain/blob/master/finalreport/vishalchangrani_Kaggle_outbrain_report.pdf