SlideShare a Scribd company logo
1 of 15
Download to read offline
Arzam M. Kotriwala
Ad Click Prediction
Mazen Aly
A DatA-Intensive Problem
Implications
of Predicting
Ad Clicks ● Boost revenues
● Huge online ad industry
● Relevance is key
● Revenue prediction
Big data
challenges
● Machine learning models
● Insufficient memory
● Data merging
● Slow processing
Research
Question
How to handle big data
for machine learning
using limited memory?
Dataset
Ads on Avito
Challenge: Data Merging
Solution: Data Merging
Created database Indexes on columns
to join tables with
Python script to
Process, join &
write data (to file)
in chunks
Results in a merged file (50 GB)
How to process the merged file?
Out-of-Core
Learning
Representative
sampling
Smart Downsampling of Training Data
● Any query for which at least one of the ads was clicked.
● A fraction r ∈ (0, 1] of the queries where none of the ads were clicked.
Fixing sampling bias
reduced loss by
78%
Fixing the sampling bias*
*Ad Click Prediction: a View from the Trenches (Google, 2013)
How do you choose the sampling probability?
Experiments have verified that even fairly aggressive sub-sampling of
unclicked queries has a very mild impact on accuracy, and that predictive
performance is not especially impacted by the specific value of r *
*Ad Click Prediction: a View from the Trenches (Google, 2013)
Sampling results
Number of context ads:
190,157,736 (50 GB)
Number of sub-sampled ads:
5,766,142 (1.5 GB)
FeATURE EnGINEERING
Feature Description
Day_of_week Day of the week extracted from the ad’s posted date
Hour Hour of day extracted from the ad’s posted date
Search_Ad_Ratio Similarity between search query and ad title
User_click_prob Historic probability of clicking an ad per user
Regular_ads_no Number of regular ads per query
Context_ads_no Number of context ads per query
Highlighted_ads_no Number of highlighted ads per query
Validation using Last 4 Days
Validation set size: 615,144 (11% of the sampled data)
Predictive Model
Fit logistic regression model
on training data
Make predictions
on test set
Make predictions
on validation set
Evaluate locally
using log loss
0.0512
Evaluate on Kaggle
using log loss
0.0588

More Related Content

What's hot

MATLAB Project Ideas Engineering Research Assistance
MATLAB Project Ideas Engineering Research AssistanceMATLAB Project Ideas Engineering Research Assistance
MATLAB Project Ideas Engineering Research AssistanceMatlab Simulation
 
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)Lex Toumbourou
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kagglerKai Xin Thia
 
Machine Learning with Multiple Regression - APAC
Machine Learning with Multiple Regression - APACMachine Learning with Multiple Regression - APAC
Machine Learning with Multiple Regression - APACMinitab, LLC
 
Online_News_Popularity_Machine_Learning
Online_News_Popularity_Machine_LearningOnline_News_Popularity_Machine_Learning
Online_News_Popularity_Machine_LearningDibyajyoti Bose
 
Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Minitab, LLC
 
PyData London 2018 talk on feature selection
PyData London 2018 talk on feature selectionPyData London 2018 talk on feature selection
PyData London 2018 talk on feature selectionThomas Huijskens
 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIMLILAB
 
Big data fusion and parametrization for strategic transport models
Big data fusion and parametrization for strategic transport modelsBig data fusion and parametrization for strategic transport models
Big data fusion and parametrization for strategic transport modelsLuuk Brederode
 
SigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt
 
Freenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformFreenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformBrandon White
 
Machine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APACMachine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APACMinitab, LLC
 

What's hot (13)

MATLAB Project Ideas Engineering Research Assistance
MATLAB Project Ideas Engineering Research AssistanceMATLAB Project Ideas Engineering Research Assistance
MATLAB Project Ideas Engineering Research Assistance
 
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
Emerging Best Practises for Machine Learning Engineering (Canberra Meetup edits)
 
Musings of kaggler
Musings of kagglerMusings of kaggler
Musings of kaggler
 
Machine Learning with Multiple Regression - APAC
Machine Learning with Multiple Regression - APACMachine Learning with Multiple Regression - APAC
Machine Learning with Multiple Regression - APAC
 
Pydata presentation
Pydata presentationPydata presentation
Pydata presentation
 
Online_News_Popularity_Machine_Learning
Online_News_Popularity_Machine_LearningOnline_News_Popularity_Machine_Learning
Online_News_Popularity_Machine_Learning
 
Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC Machine Learning with Classification & Regression Trees - APAC
Machine Learning with Classification & Regression Trees - APAC
 
PyData London 2018 talk on feature selection
PyData London 2018 talk on feature selectionPyData London 2018 talk on feature selection
PyData London 2018 talk on feature selection
 
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AIT. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
T. Yoon, et. al., ICLR 2021, MLILAB, KAIST AI
 
Big data fusion and parametrization for strategic transport models
Big data fusion and parametrization for strategic transport modelsBig data fusion and parametrization for strategic transport models
Big data fusion and parametrization for strategic transport models
 
SigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt for Hedge Funds
SigOpt for Hedge Funds
 
Freenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning PlatformFreenome's Biological Machine Learning Platform
Freenome's Biological Machine Learning Platform
 
Machine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APACMachine Learning with Binary Logistic Regression - APAC
Machine Learning with Binary Logistic Regression - APAC
 

Similar to Presentation: Ad-Click Prediction, A Data-Intensive Problem

Algorithmic marketplace
Algorithmic marketplaceAlgorithmic marketplace
Algorithmic marketplacereducedata
 
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
Data Science Salon: Enabling self-service predictive analytics at BidtellectData Science Salon: Enabling self-service predictive analytics at Bidtellect
Data Science Salon: Enabling self-service predictive analytics at BidtellectFormulatedby
 
Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020
Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020 Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020
Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020 Fouad Saeidi
 
How to Effectively Target Search Queries in Google Shopping Campaigns
How to Effectively Target Search Queries in Google Shopping CampaignsHow to Effectively Target Search Queries in Google Shopping Campaigns
How to Effectively Target Search Queries in Google Shopping CampaignsTinuiti
 
AUTOMATION & CROSS-CHANNEL
AUTOMATION & CROSS-CHANNELAUTOMATION & CROSS-CHANNEL
AUTOMATION & CROSS-CHANNELTinuiti
 
Optimization of digital marketing campaigns
Optimization of digital marketing campaignsOptimization of digital marketing campaigns
Optimization of digital marketing campaignsArmando Vieira
 
Using data and different measurement approaches to understand incrementality...
 Using data and different measurement approaches to understand incrementality... Using data and different measurement approaches to understand incrementality...
Using data and different measurement approaches to understand incrementality...GameCamp
 
Machine Learning and Remarketing
Machine Learning and RemarketingMachine Learning and Remarketing
Machine Learning and RemarketingClark Boyd
 
Creative that Counts - Beth Sibbring from Tangible Impact at Columbus WAW
Creative that Counts - Beth Sibbring from Tangible Impact at Columbus WAWCreative that Counts - Beth Sibbring from Tangible Impact at Columbus WAW
Creative that Counts - Beth Sibbring from Tangible Impact at Columbus WAWTim Wilson
 
SearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of Shopping
SearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of ShoppingSearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of Shopping
SearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of ShoppingBranded3
 
eyeDemand "Demystifying RTB: Keys to a Successful Campaign"
eyeDemand "Demystifying RTB: Keys to a Successful Campaign"eyeDemand "Demystifying RTB: Keys to a Successful Campaign"
eyeDemand "Demystifying RTB: Keys to a Successful Campaign"IAB Canada
 
Advanced Google Analytics 4.0 by Aviso Digital
Advanced Google Analytics 4.0 by Aviso DigitalAdvanced Google Analytics 4.0 by Aviso Digital
Advanced Google Analytics 4.0 by Aviso DigitalSumeet Mayor
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value BigDataCloud
 
► Performance Advertising - 22 Key Insights From FIRST
 ► Performance Advertising - 22 Key Insights From FIRST ► Performance Advertising - 22 Key Insights From FIRST
► Performance Advertising - 22 Key Insights From FIRSTFIRST
 
Maximizing Conversions And Overall Campaign Roi Presentation
Maximizing Conversions And Overall Campaign Roi PresentationMaximizing Conversions And Overall Campaign Roi Presentation
Maximizing Conversions And Overall Campaign Roi Presentationjward5519
 
Mastering Paid Search Automation
Mastering Paid Search AutomationMastering Paid Search Automation
Mastering Paid Search AutomationROI Revolution
 
Intent Based Segmentation by CleverTap
Intent Based Segmentation by CleverTapIntent Based Segmentation by CleverTap
Intent Based Segmentation by CleverTapCleverTap
 
Conversion Rate Optimization for Business Growth
Conversion Rate Optimization for Business GrowthConversion Rate Optimization for Business Growth
Conversion Rate Optimization for Business GrowthReapDigital
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Christopher Gutknecht
 

Similar to Presentation: Ad-Click Prediction, A Data-Intensive Problem (20)

Algorithmic marketplace
Algorithmic marketplaceAlgorithmic marketplace
Algorithmic marketplace
 
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
Data Science Salon: Enabling self-service predictive analytics at BidtellectData Science Salon: Enabling self-service predictive analytics at Bidtellect
Data Science Salon: Enabling self-service predictive analytics at Bidtellect
 
Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020
Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020 Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020
Apple Search Ads Workshop in Mobile Growth Summit San Francisco 2020
 
How to Effectively Target Search Queries in Google Shopping Campaigns
How to Effectively Target Search Queries in Google Shopping CampaignsHow to Effectively Target Search Queries in Google Shopping Campaigns
How to Effectively Target Search Queries in Google Shopping Campaigns
 
AUTOMATION & CROSS-CHANNEL
AUTOMATION & CROSS-CHANNELAUTOMATION & CROSS-CHANNEL
AUTOMATION & CROSS-CHANNEL
 
Optimization of digital marketing campaigns
Optimization of digital marketing campaignsOptimization of digital marketing campaigns
Optimization of digital marketing campaigns
 
Using data and different measurement approaches to understand incrementality...
 Using data and different measurement approaches to understand incrementality... Using data and different measurement approaches to understand incrementality...
Using data and different measurement approaches to understand incrementality...
 
Machine Learning and Remarketing
Machine Learning and RemarketingMachine Learning and Remarketing
Machine Learning and Remarketing
 
Creative that Counts - Beth Sibbring from Tangible Impact at Columbus WAW
Creative that Counts - Beth Sibbring from Tangible Impact at Columbus WAWCreative that Counts - Beth Sibbring from Tangible Impact at Columbus WAW
Creative that Counts - Beth Sibbring from Tangible Impact at Columbus WAW
 
SearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of Shopping
SearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of ShoppingSearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of Shopping
SearchLeeds 2018 - Elizabeth Clark - Dream Agility - The future of Shopping
 
eyeDemand "Demystifying RTB: Keys to a Successful Campaign"
eyeDemand "Demystifying RTB: Keys to a Successful Campaign"eyeDemand "Demystifying RTB: Keys to a Successful Campaign"
eyeDemand "Demystifying RTB: Keys to a Successful Campaign"
 
Advanced Google Analytics 4.0 by Aviso Digital
Advanced Google Analytics 4.0 by Aviso DigitalAdvanced Google Analytics 4.0 by Aviso Digital
Advanced Google Analytics 4.0 by Aviso Digital
 
Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value Using Advanced Analyics to bring Business Value
Using Advanced Analyics to bring Business Value
 
► Performance Advertising - 22 Key Insights From FIRST
 ► Performance Advertising - 22 Key Insights From FIRST ► Performance Advertising - 22 Key Insights From FIRST
► Performance Advertising - 22 Key Insights From FIRST
 
Maximizing Conversions And Overall Campaign Roi Presentation
Maximizing Conversions And Overall Campaign Roi PresentationMaximizing Conversions And Overall Campaign Roi Presentation
Maximizing Conversions And Overall Campaign Roi Presentation
 
Mastering Paid Search Automation
Mastering Paid Search AutomationMastering Paid Search Automation
Mastering Paid Search Automation
 
Intent Based Segmentation by CleverTap
Intent Based Segmentation by CleverTapIntent Based Segmentation by CleverTap
Intent Based Segmentation by CleverTap
 
Making Marketing Operations "Efficient" - Where are You on the Maturity Frame...
Making Marketing Operations "Efficient" - Where are You on the Maturity Frame...Making Marketing Operations "Efficient" - Where are You on the Maturity Frame...
Making Marketing Operations "Efficient" - Where are You on the Maturity Frame...
 
Conversion Rate Optimization for Business Growth
Conversion Rate Optimization for Business GrowthConversion Rate Optimization for Business Growth
Conversion Rate Optimization for Business Growth
 
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
Your Raw Data is Ready - Introduction to Analytics Engineering | SMX Advanced...
 

Recently uploaded

Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsThinkInnovation
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxFinatron037
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxdhiyaneswaranv1
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 

Recently uploaded (16)

Optimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in LogisticsOptimal Decision Making - Cost Reduction in Logistics
Optimal Decision Making - Cost Reduction in Logistics
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Rock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptxRock Songs common codes and conventions.pptx
Rock Songs common codes and conventions.pptx
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptxCCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
CCS336-Cloud-Services-Management-Lecture-Notes-1.pptx
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 

Presentation: Ad-Click Prediction, A Data-Intensive Problem

  • 1. Arzam M. Kotriwala Ad Click Prediction Mazen Aly A DatA-Intensive Problem
  • 2. Implications of Predicting Ad Clicks ● Boost revenues ● Huge online ad industry ● Relevance is key ● Revenue prediction
  • 3. Big data challenges ● Machine learning models ● Insufficient memory ● Data merging ● Slow processing
  • 4. Research Question How to handle big data for machine learning using limited memory?
  • 8. Solution: Data Merging Created database Indexes on columns to join tables with Python script to Process, join & write data (to file) in chunks Results in a merged file (50 GB)
  • 9. How to process the merged file? Out-of-Core Learning Representative sampling
  • 10. Smart Downsampling of Training Data ● Any query for which at least one of the ads was clicked. ● A fraction r ∈ (0, 1] of the queries where none of the ads were clicked. Fixing sampling bias reduced loss by 78% Fixing the sampling bias* *Ad Click Prediction: a View from the Trenches (Google, 2013)
  • 11. How do you choose the sampling probability? Experiments have verified that even fairly aggressive sub-sampling of unclicked queries has a very mild impact on accuracy, and that predictive performance is not especially impacted by the specific value of r * *Ad Click Prediction: a View from the Trenches (Google, 2013)
  • 12. Sampling results Number of context ads: 190,157,736 (50 GB) Number of sub-sampled ads: 5,766,142 (1.5 GB)
  • 13. FeATURE EnGINEERING Feature Description Day_of_week Day of the week extracted from the ad’s posted date Hour Hour of day extracted from the ad’s posted date Search_Ad_Ratio Similarity between search query and ad title User_click_prob Historic probability of clicking an ad per user Regular_ads_no Number of regular ads per query Context_ads_no Number of context ads per query Highlighted_ads_no Number of highlighted ads per query
  • 14. Validation using Last 4 Days Validation set size: 615,144 (11% of the sampled data)
  • 15. Predictive Model Fit logistic regression model on training data Make predictions on test set Make predictions on validation set Evaluate locally using log loss 0.0512 Evaluate on Kaggle using log loss 0.0588