SlideShare a Scribd company logo
1 of 31
© 2015 IBM Corporation
Predict whether an online ad will be
clicked using Spark MLlib pipelines
Presenter: Manisha Sule, IBM Spark Enablement
© 2015 IBM Corporation2
Outline
 Online advertising and click through rate prediction
 Data set and features
 Spark MLlib and the Pipeline API
 MLlib pipeline for Click Through Rate Prediction
 Questions
Online Advertising
© 2015 IBM Corporation4
Online advertising
© 2015 IBM Corporation5
Online advertising on search engines
© 2015 IBM Corporation6
Social Media advertising
 Facebook's "Sponsored Stories"
 LinkedIn's "Sponsored Updates"
 Twitter's "Promoted Tweets"
© 2015 IBM Corporation7
Online advertising is big business
© 2015 IBM Corporation8
Types of online advertising
 Retargeting – Using cookies, track if a user left a webpage without making a purchase and
retarget the user with ads from that site
 Behavioral targeting – Data related to user’s online activity is collected from multiple
websites, thus creating a detailed picture of the user’s interests to deliver more targeted
advertising
 Contextual advertising – Display ads related to the content of the webpage
 Geo targeting – Ads are presented based on the user’s suspected geography
© 2015 IBM Corporation9
Online Advertising – The Players
Publishers:
 Earn revenue by displaying ads on their sites
 Google, Wall Street Journal, Twitter
Advertisers:
 Pay for their ads to be displayed on publisher sites
 Goal is to increase business via advertising
Matchmakers:
 Match publishers with advertisers
 Interaction with advertisers and publishers occurs real time
© 2015 IBM Corporation10
Revenue models from online advertising
 CPM (Cost-Per-Mille): is an inventory based pricing model. Is when the price is based on
1,000 impressions.
 CPC (Cost-Per-Click): Is a performance-based metric. This means the Publisher only gets
paid when (and if) a user clicks on an ad
 CPA (Cost Per Action): Best deal of all for Advertisers in terms of risk because they only
pay for media when it results in a sale
© 2015 IBM Corporation11
Click-through rate (CTR)
 Ratio of users who click on an ad to the number of total users who view the ad
 Used to measure the success of an online advertising campaign
 Very first online ad shown for AT&T on website HotWired has a 44% Click-through rate
 Today, typical click through rate is less than 1%
 In a pay-per-click setting, revenue can be maximized by choosing to display ads that have
the maximum CTR, hence the problem of CTR Prediction.
© 2015 IBM Corporation12
Predict CTR – a Scalable Machine Learning success story
Predict conditional probability that the ad will be clicked by the user given the predictive
features of ads
Predictive features are:
 Ad’s historical performance
 Advertiser and ad content info
 Publisher info
 User Info (eg: search/ click history)
Data set is high dimensional, sparse and skewed:
 Hundreds of millions of online users
 Millions of unique publisher pages to display ads
 Millions of unique ads to display
 Very few ads get clicked by users
Dataset and features
© 2015 IBM Corporation14
About the dataset used
 Sample of the dataset used for the Display Advertising Challenge hosted by Kaggle:
https://www.kaggle.com/c/criteo-display-ad-challenge/
 Consists of a portion of Criteo's traffic over a period of 7 days.
© 2015 IBM Corporation15
Dataset Features
Data fields
 Label - Target variable that indicates if an ad was clicked (1) or not (0).
 I1-I13 - A total of 13 columns of integer features (mostly count features).
 C1-C26 - A total of 26 columns of categorical features. The values of these features have
been hashed onto 32 bits for anonymization purposes.
Spark MLlib and Pipeline API
© 2015 IBM Corporation17
Spark MLlib
 Spark’s machine learning (ML) library.
 Goal is to make ML scalable and easy
 Includes common learning algorithms and utilities, like classification, regression, clustering,
collaborative filtering, dimensionality reduction
 Includes lower-level optimization primitives and higher level pipeline APIs
 Divided into two packages
• Spark.mllib: original API built on top of RDDs
• Spark.ml: provides higher level API built on top of DataFrames for constructing ML
pipelines
© 2015 IBM Corporation18
ML pipelines for complex ML workflows
 DataFrame: Equivalent to a relational table in Spark SQL
 Transformer: An algorithm that transforms one DataFrame into another
 Estimator: An algorithm that can be fit on a DataFrame to produce a Transformer
 Pipeline: Chains multiple transformers and estimators to
produce a ML workflow
 Parameter: Common API for specifying parameters
MLlib Pipeline for Click Through Rate
Prediction
© 2015 IBM Corporation20
Overview of ML pipeline for Click Through Rate Prediction
© 2015 IBM Corporation21
Step 1 - Parse Data into Spark SQL DataFrames
 Spark SQL converts a RDD of Row objects into a DataFrame
 Rows are constructed by passing key/value pairs where keys define the column names of
the table
 The data types are inferred by looking at the data
© 2015 IBM Corporation22
Step 2 – Feature Transformer using StringIndexer
 Encodes a string column to a column of indices
 Indices are from 0 to max number of distinct string values, ordered by frequencies
 If column is numeric, cast it to string and index string values
© 2015 IBM Corporation23
Step 3 - Feature Transformer using One Hot Encoding
 Maps a column of label indices to a column of binary vectors
 Allows algorithms which expect continuous features, to use categorical features
© 2015 IBM Corporation24
Step 4 – Feature Selector using Vector Assembler
 Transformer that combines a given list of columns into a single vector column
 Useful for combining raw features and features generated by different feature transformers
into a single feature vector
© 2015 IBM Corporation25
Step 5 – Train a model using Estimator Logistic Regression
© 2015 IBM Corporation26
Apply the pipeline to make predictions
Demo
© 2015 IBM Corporation28
Parameter Tuning using CrossValidator or TrainValidationSplit
© 2015 IBM Corporation29
Some alternatives:
 Use Hashed features instead of OHE
 Use Log loss evaluation or ROC to evaluate Logistic Regression
 Perform feature selection
 Use Naïve Bayes or other binary classification algorithms
© 2015 IBM Corporation30
Thank you!
© 2015 IBM Corporation31
References:
 Wikipedia: https://en.wikipedia.org/wiki/Online_advertising
 BerkeleyX: CS190.1x Scalable Machine Learning (https://courses.edx.org/)
 Big Data University: Spark Fundamentals 1 (https://bigdatauniversity.com/courses/spark-
fundamentals/)
 Big Data University: Spark Fundamentals 2 (http://bigdatauniversity.com/courses/spark-
fundamentals-ii/)
 Learning Spark: Lightning-Fast Big Data Analysis (Book by Andy Konwinski, Holden Karau,
and Patrick Wendell)
 Advanced Analytics with Spark: Patterns for Learning from Data at Scale (Book by Josh
Wills, Sandy Ryza, Sean Owen, and Uri Laserson)
 Spark Programming Guide (http://spark.apache.org/docs/latest/programming-guide.html)

More Related Content

What's hot

Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline DevelopmentTimothy Spann
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectiveJustin Basilico
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big datahktripathy
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector MachinesCloudxLab
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022ArangoDB Database
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache SparkRahul Jain
 
Improving Machine Learning using Graph Algorithms
Improving Machine Learning using Graph AlgorithmsImproving Machine Learning using Graph Algorithms
Improving Machine Learning using Graph AlgorithmsNeo4j
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for GraphsDeepLearningBlr
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix ScaleJustin Basilico
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learningDr Wei Liu
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningAbzetdin Adamov
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated RecommendationsHarald Steck
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGrubhubTech
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix ScaleAish Fenton
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...Sudeep Das, Ph.D.
 
Recommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixJiangwei Pan
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality ReductionKnoldus Inc.
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLJorge Cardoso
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUSri Geetha
 

What's hot (20)

Meetup: Streaming Data Pipeline Development
Meetup:  Streaming Data Pipeline DevelopmentMeetup:  Streaming Data Pipeline Development
Meetup: Streaming Data Pipeline Development
 
Past, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry PerspectivePast, Present & Future of Recommender Systems: An Industry Perspective
Past, Present & Future of Recommender Systems: An Industry Perspective
 
Lecture1 introduction to big data
Lecture1 introduction to big dataLecture1 introduction to big data
Lecture1 introduction to big data
 
Support Vector Machines
Support Vector MachinesSupport Vector Machines
Support Vector Machines
 
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
Machine Learning + Graph Databases for Better Recommendations V1 08/06/2022
 
Introduction to Apache Spark
Introduction to Apache SparkIntroduction to Apache Spark
Introduction to Apache Spark
 
Improving Machine Learning using Graph Algorithms
Improving Machine Learning using Graph AlgorithmsImproving Machine Learning using Graph Algorithms
Improving Machine Learning using Graph Algorithms
 
Deep Learning for Graphs
Deep Learning for GraphsDeep Learning for Graphs
Deep Learning for Graphs
 
Recommendation at Netflix Scale
Recommendation at Netflix ScaleRecommendation at Netflix Scale
Recommendation at Netflix Scale
 
Time series forecasting with machine learning
Time series forecasting with machine learningTime series forecasting with machine learning
Time series forecasting with machine learning
 
Understanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine LearningUnderstanding your Data - Data Analytics Lifecycle and Machine Learning
Understanding your Data - Data Analytics Lifecycle and Machine Learning
 
Calibrated Recommendations
Calibrated RecommendationsCalibrated Recommendations
Calibrated Recommendations
 
GTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerceGTC 2021: Counterfactual Learning to Rank in E-commerce
GTC 2021: Counterfactual Learning to Rank in E-commerce
 
Machine Learning at Netflix Scale
Machine Learning at Netflix ScaleMachine Learning at Netflix Scale
Machine Learning at Netflix Scale
 
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se... Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
Deeper Things: How Netflix Leverages Deep Learning in Recommendations and Se...
 
Recommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at NetflixRecommendation Modeling with Impression Data at Netflix
Recommendation Modeling with Impression Data at Netflix
 
Dimensionality Reduction
Dimensionality ReductionDimensionality Reduction
Dimensionality Reduction
 
Distributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using MLDistributed Trace & Log Analysis using ML
Distributed Trace & Log Analysis using ML
 
INTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRUINTRODUCTION TO NLP, RNN, LSTM, GRU
INTRODUCTION TO NLP, RNN, LSTM, GRU
 
What is MLOps
What is MLOpsWhat is MLOps
What is MLOps
 

Similar to CTR Prediction using Spark Machine Learning Pipelines

Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Databricks
 
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516Tanjina Prema
 
API and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAPI and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAxway
 
Iag api management architect presentation
Iag   api management architect presentationIag   api management architect presentation
Iag api management architect presentationsflynn073
 
Building SaaS on AWS Serverless
Building SaaS on AWS ServerlessBuilding SaaS on AWS Serverless
Building SaaS on AWS ServerlessHolger Reinhardt
 
Building a SaaS on AWS Serverless
Building a SaaS on AWS ServerlessBuilding a SaaS on AWS Serverless
Building a SaaS on AWS ServerlessAdello
 
20141209 meetup hassan
20141209 meetup hassan20141209 meetup hassan
20141209 meetup hassanNanda Kishore
 
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh SistuUsing Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh SistuHostedbyConfluent
 
Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?Bala Iyer
 
IBM APM for Hybrid Applications
IBM APM for Hybrid ApplicationsIBM APM for Hybrid Applications
IBM APM for Hybrid ApplicationsMatthew Cheah
 
API Economy: 2016 Horizonwatch Trend Brief
API Economy:  2016 Horizonwatch Trend BriefAPI Economy:  2016 Horizonwatch Trend Brief
API Economy: 2016 Horizonwatch Trend BriefBill Chamberlin
 
CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2Adam Hecht
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Databricks
 
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...MLconf
 
How to Execute a Successful API Strategy
How to Execute a Successful API StrategyHow to Execute a Successful API Strategy
How to Execute a Successful API StrategyMatt McLarty
 
UK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed ITUK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed ITAndyHumphreys
 
Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0sflynn073
 
API Management
API ManagementAPI Management
API ManagementProlifics
 

Similar to CTR Prediction using Spark Machine Learning Pipelines (20)

Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
 
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
Hybrid cloud-cloud-services-white-paper-external-apw12358usen-20180516
 
API and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local MarketsAPI and Platform Strategies to Win in Global and Local Markets
API and Platform Strategies to Win in Global and Local Markets
 
Iag api management architect presentation
Iag   api management architect presentationIag   api management architect presentation
Iag api management architect presentation
 
Building SaaS on AWS Serverless
Building SaaS on AWS ServerlessBuilding SaaS on AWS Serverless
Building SaaS on AWS Serverless
 
Building a SaaS on AWS Serverless
Building a SaaS on AWS ServerlessBuilding a SaaS on AWS Serverless
Building a SaaS on AWS Serverless
 
20141209 meetup hassan
20141209 meetup hassan20141209 meetup hassan
20141209 meetup hassan
 
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh SistuUsing Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
Using Machine Learning to Govern Kafka Clients with Shu Wang & UmaMahesh Sistu
 
Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?Era of APIs: Why do we need an API strategy?
Era of APIs: Why do we need an API strategy?
 
AdvertOne
AdvertOneAdvertOne
AdvertOne
 
Ad exchange product description
Ad exchange product descriptionAd exchange product description
Ad exchange product description
 
IBM APM for Hybrid Applications
IBM APM for Hybrid ApplicationsIBM APM for Hybrid Applications
IBM APM for Hybrid Applications
 
API Economy: 2016 Horizonwatch Trend Brief
API Economy:  2016 Horizonwatch Trend BriefAPI Economy:  2016 Horizonwatch Trend Brief
API Economy: 2016 Horizonwatch Trend Brief
 
CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2CompeteWorkshop_ALLSLIDESv2
CompeteWorkshop_ALLSLIDESv2
 
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
Predicting Banking Customer Needs with an Agile Approach to Analytics in the ...
 
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
Damien Lefortier, Senior Machine Learning Engineer and Tech Lead in the Predi...
 
How to Execute a Successful API Strategy
How to Execute a Successful API StrategyHow to Execute a Successful API Strategy
How to Execute a Successful API Strategy
 
UK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed ITUK Integration WebSphere User Group - MultiSpeed IT
UK Integration WebSphere User Group - MultiSpeed IT
 
Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0Manage your ap is securely and easily ibm apim 4.0
Manage your ap is securely and easily ibm apim 4.0
 
API Management
API ManagementAPI Management
API Management
 

Recently uploaded

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNKTimothy Spann
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制vexqp
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...HyderabadDolls
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...nirzagarg
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numberssuginr1
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraGovindSinghDasila
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubaikojalkojal131
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxronsairoathenadugay
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...gragchanchal546
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareGraham Ware
 

Recently uploaded (20)

DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Purnia [ 7014168258 ] Call Me For Genuine Models We...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Dubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls DubaiDubai Call Girls Peeing O525547819 Call Girls Dubai
Dubai Call Girls Peeing O525547819 Call Girls Dubai
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
Gulbai Tekra * Cheap Call Girls In Ahmedabad Phone No 8005736733 Elite Escort...
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 

CTR Prediction using Spark Machine Learning Pipelines

  • 1. © 2015 IBM Corporation Predict whether an online ad will be clicked using Spark MLlib pipelines Presenter: Manisha Sule, IBM Spark Enablement
  • 2. © 2015 IBM Corporation2 Outline  Online advertising and click through rate prediction  Data set and features  Spark MLlib and the Pipeline API  MLlib pipeline for Click Through Rate Prediction  Questions
  • 4. © 2015 IBM Corporation4 Online advertising
  • 5. © 2015 IBM Corporation5 Online advertising on search engines
  • 6. © 2015 IBM Corporation6 Social Media advertising  Facebook's "Sponsored Stories"  LinkedIn's "Sponsored Updates"  Twitter's "Promoted Tweets"
  • 7. © 2015 IBM Corporation7 Online advertising is big business
  • 8. © 2015 IBM Corporation8 Types of online advertising  Retargeting – Using cookies, track if a user left a webpage without making a purchase and retarget the user with ads from that site  Behavioral targeting – Data related to user’s online activity is collected from multiple websites, thus creating a detailed picture of the user’s interests to deliver more targeted advertising  Contextual advertising – Display ads related to the content of the webpage  Geo targeting – Ads are presented based on the user’s suspected geography
  • 9. © 2015 IBM Corporation9 Online Advertising – The Players Publishers:  Earn revenue by displaying ads on their sites  Google, Wall Street Journal, Twitter Advertisers:  Pay for their ads to be displayed on publisher sites  Goal is to increase business via advertising Matchmakers:  Match publishers with advertisers  Interaction with advertisers and publishers occurs real time
  • 10. © 2015 IBM Corporation10 Revenue models from online advertising  CPM (Cost-Per-Mille): is an inventory based pricing model. Is when the price is based on 1,000 impressions.  CPC (Cost-Per-Click): Is a performance-based metric. This means the Publisher only gets paid when (and if) a user clicks on an ad  CPA (Cost Per Action): Best deal of all for Advertisers in terms of risk because they only pay for media when it results in a sale
  • 11. © 2015 IBM Corporation11 Click-through rate (CTR)  Ratio of users who click on an ad to the number of total users who view the ad  Used to measure the success of an online advertising campaign  Very first online ad shown for AT&T on website HotWired has a 44% Click-through rate  Today, typical click through rate is less than 1%  In a pay-per-click setting, revenue can be maximized by choosing to display ads that have the maximum CTR, hence the problem of CTR Prediction.
  • 12. © 2015 IBM Corporation12 Predict CTR – a Scalable Machine Learning success story Predict conditional probability that the ad will be clicked by the user given the predictive features of ads Predictive features are:  Ad’s historical performance  Advertiser and ad content info  Publisher info  User Info (eg: search/ click history) Data set is high dimensional, sparse and skewed:  Hundreds of millions of online users  Millions of unique publisher pages to display ads  Millions of unique ads to display  Very few ads get clicked by users
  • 14. © 2015 IBM Corporation14 About the dataset used  Sample of the dataset used for the Display Advertising Challenge hosted by Kaggle: https://www.kaggle.com/c/criteo-display-ad-challenge/  Consists of a portion of Criteo's traffic over a period of 7 days.
  • 15. © 2015 IBM Corporation15 Dataset Features Data fields  Label - Target variable that indicates if an ad was clicked (1) or not (0).  I1-I13 - A total of 13 columns of integer features (mostly count features).  C1-C26 - A total of 26 columns of categorical features. The values of these features have been hashed onto 32 bits for anonymization purposes.
  • 16. Spark MLlib and Pipeline API
  • 17. © 2015 IBM Corporation17 Spark MLlib  Spark’s machine learning (ML) library.  Goal is to make ML scalable and easy  Includes common learning algorithms and utilities, like classification, regression, clustering, collaborative filtering, dimensionality reduction  Includes lower-level optimization primitives and higher level pipeline APIs  Divided into two packages • Spark.mllib: original API built on top of RDDs • Spark.ml: provides higher level API built on top of DataFrames for constructing ML pipelines
  • 18. © 2015 IBM Corporation18 ML pipelines for complex ML workflows  DataFrame: Equivalent to a relational table in Spark SQL  Transformer: An algorithm that transforms one DataFrame into another  Estimator: An algorithm that can be fit on a DataFrame to produce a Transformer  Pipeline: Chains multiple transformers and estimators to produce a ML workflow  Parameter: Common API for specifying parameters
  • 19. MLlib Pipeline for Click Through Rate Prediction
  • 20. © 2015 IBM Corporation20 Overview of ML pipeline for Click Through Rate Prediction
  • 21. © 2015 IBM Corporation21 Step 1 - Parse Data into Spark SQL DataFrames  Spark SQL converts a RDD of Row objects into a DataFrame  Rows are constructed by passing key/value pairs where keys define the column names of the table  The data types are inferred by looking at the data
  • 22. © 2015 IBM Corporation22 Step 2 – Feature Transformer using StringIndexer  Encodes a string column to a column of indices  Indices are from 0 to max number of distinct string values, ordered by frequencies  If column is numeric, cast it to string and index string values
  • 23. © 2015 IBM Corporation23 Step 3 - Feature Transformer using One Hot Encoding  Maps a column of label indices to a column of binary vectors  Allows algorithms which expect continuous features, to use categorical features
  • 24. © 2015 IBM Corporation24 Step 4 – Feature Selector using Vector Assembler  Transformer that combines a given list of columns into a single vector column  Useful for combining raw features and features generated by different feature transformers into a single feature vector
  • 25. © 2015 IBM Corporation25 Step 5 – Train a model using Estimator Logistic Regression
  • 26. © 2015 IBM Corporation26 Apply the pipeline to make predictions
  • 27. Demo
  • 28. © 2015 IBM Corporation28 Parameter Tuning using CrossValidator or TrainValidationSplit
  • 29. © 2015 IBM Corporation29 Some alternatives:  Use Hashed features instead of OHE  Use Log loss evaluation or ROC to evaluate Logistic Regression  Perform feature selection  Use Naïve Bayes or other binary classification algorithms
  • 30. © 2015 IBM Corporation30 Thank you!
  • 31. © 2015 IBM Corporation31 References:  Wikipedia: https://en.wikipedia.org/wiki/Online_advertising  BerkeleyX: CS190.1x Scalable Machine Learning (https://courses.edx.org/)  Big Data University: Spark Fundamentals 1 (https://bigdatauniversity.com/courses/spark- fundamentals/)  Big Data University: Spark Fundamentals 2 (http://bigdatauniversity.com/courses/spark- fundamentals-ii/)  Learning Spark: Lightning-Fast Big Data Analysis (Book by Andy Konwinski, Holden Karau, and Patrick Wendell)  Advanced Analytics with Spark: Patterns for Learning from Data at Scale (Book by Josh Wills, Sandy Ryza, Sean Owen, and Uri Laserson)  Spark Programming Guide (http://spark.apache.org/docs/latest/programming-guide.html)

Editor's Notes

  1. 1