The Impact of Big Data on
Classic Machine Learning
Algorithms
Thomas Jensen, Senior Business Analyst @ Expedia
Who am I?
• Senior Business Analyst @ Expedia
• Working within the competitive
intelligence unit
• Responsible for :
• Algorithm that score new hotels
• Algorithm that predicts room nights
sold on existing Expedia hotels
• Scraping competitor sites
• Other stuff….
The Promise of Big Data
Real time data
Data driven decision
More accurate and
robust models
Granularity
Big Data Challenges
Data Processing – not
going to talk about
this.
Speed at which to use
data – how fast should
we update
algorithms?
How do we train
algorithms on data
sets that do not fit
into memory?
Big Data Challenges
Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Classification - Logistic Regression
• One classic task in machine learning / statistics is to classify some
objects/events/decisions correctly
• Examples are:
• Customer churn
• Click behavior
• Purchase behavior
• ….
• One of the most popular algorithms to carry out these tasks is logistic
regression
What is logistic regression?
• Logistic regression attaches probabilities to individual outcomes,
showing how likely they are to belong to one class or the other
• Pr 𝑦 𝑥 =
1
1+𝑒−𝑥𝛽
• The challenge is to choose the
optimal beta(s)
• To do that we minimize a cost
function
Why Use Logistic Regression?
• It is simple and well understood algorithm
• Outputs probabilities
• There are tried and tested models to estimate the parameters
• It is flexible – can handle a number of different inputs, and feature
transformations
Usual Approaches
• Batch training (offline approach)
• Get all the data and train the algorithm in one go
• Disadvantages when data is big
• Requires all data to be loaded into memory
• Periodic retraining is necessary
• Very time consuming with big data!
Batch Training
Examples of Logistic Regression in Industry
Settings – Real Time Bidding
• RTB
• RTB algorithms are usually
based on logistic regression
• Whether or not to bid on a
user is determined by the
probability that the user will
click on an add
• Each day billions of bids are
processed
• Each bid has to be processed
within 80 milliseconds
Examples of Logistic Regression in Industry
Settings – Fraud Detection
Detecting Fraudulent Credit Card
Transactions
• The probability that a transaction
is using a stolen credit card is
typically estimated with logistic
regression
• Billions of transactions are
analyzed each day
How Slow is the Batch Version of Logistic
Regression?
One target variable and two feature vectors.
All randomly generated.
A Real World Problem
A Real World Problem
• Some stats on the training job in the pipeline:
• Runs training jobs on a per country basis
• Longest running job lasts ~9 hours
• Shortest running job lasts ~3 hours
• There are often convergence failures
• What we need an algorithm that:
• Can reduce training time
• Is robust towards convergence failures
A Big Data Friendly Approach
Online Training
• Pass each data point sequentially through the algorithm
• Only requires one data point at a time in memory
• Allows for on-the-fly training of the algorithm
Online Learning
• We want to learn a vector of
weights
• Initialize all weights. Begin loop:
1. Get training example
2. Make a prediction for the target
variable
3. Learn the true value of the
target
4. Update the weights and go to 1
Online Learning
• Initialise all weights. Begin loop:
Repeat {
For i = 1 to m {
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃 𝑗
𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖, 𝑦𝑖))
}
}
the partial derivative
of the cost functions
the cost function – given
theta and row i, i.e. how wrong
Are we?
the step size – how fast
we should climb the
gradient
Online Learning
• Approaches the maximum of the function in a jumpy manner and
never actually settles on the maximum.
Batch vs. Online Learning
Data
Size: 4.8GB
Rows: 500,000
Columns: 5000
0
20
40
60
80
100
120
Batch SGDClassifier Sofia-ml
Training
*Times include reading data and training algorithm
Online Learning Vs. Batch
Online Learning
• When we have a continuous
stream of data
• When It is important to update
the algorithm in real time – can
hit a moving target
• When training speed is
important
• Parameters are “jumpy” around
the optimal values
Batch
• When it is very important to get
the exact optimal values
• When data can fit in memory
• When training time is not of the
essence
Popular Online Learning Libraries
• Sofia-ml (c/c++)
• Requires data in svmLight format
• Have implementations of SVM, Neural networks and logistic regression
• Supports classification and ranking
• Wovbal wabbit (c/c++)
• Requires data in own wv format
• Have implementations of the most popular loss functions
• Supports classification, ranking and regression
• Pandas + scikit-learn (python)
• Pandas has a nice function for reading files in batches
• Can handle sparse and non-sparse matrices
• Scikit–learn has an SGD classifier that can fit the model in batches
• Supports classification, ranking and regression
Thomas Jensen. Machine Learning

Thomas Jensen. Machine Learning

  • 1.
    The Impact ofBig Data on Classic Machine Learning Algorithms Thomas Jensen, Senior Business Analyst @ Expedia
  • 2.
    Who am I? •Senior Business Analyst @ Expedia • Working within the competitive intelligence unit • Responsible for : • Algorithm that score new hotels • Algorithm that predicts room nights sold on existing Expedia hotels • Scraping competitor sites • Other stuff….
  • 3.
    The Promise ofBig Data Real time data Data driven decision More accurate and robust models Granularity
  • 4.
    Big Data Challenges DataProcessing – not going to talk about this. Speed at which to use data – how fast should we update algorithms? How do we train algorithms on data sets that do not fit into memory?
  • 5.
    Big Data Challenges Takenfrom: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
  • 6.
    Classification - LogisticRegression • One classic task in machine learning / statistics is to classify some objects/events/decisions correctly • Examples are: • Customer churn • Click behavior • Purchase behavior • …. • One of the most popular algorithms to carry out these tasks is logistic regression
  • 7.
    What is logisticregression? • Logistic regression attaches probabilities to individual outcomes, showing how likely they are to belong to one class or the other • Pr 𝑦 𝑥 = 1 1+𝑒−𝑥𝛽 • The challenge is to choose the optimal beta(s) • To do that we minimize a cost function
  • 8.
    Why Use LogisticRegression? • It is simple and well understood algorithm • Outputs probabilities • There are tried and tested models to estimate the parameters • It is flexible – can handle a number of different inputs, and feature transformations
  • 9.
    Usual Approaches • Batchtraining (offline approach) • Get all the data and train the algorithm in one go • Disadvantages when data is big • Requires all data to be loaded into memory • Periodic retraining is necessary • Very time consuming with big data!
  • 10.
  • 11.
    Examples of LogisticRegression in Industry Settings – Real Time Bidding • RTB • RTB algorithms are usually based on logistic regression • Whether or not to bid on a user is determined by the probability that the user will click on an add • Each day billions of bids are processed • Each bid has to be processed within 80 milliseconds
  • 12.
    Examples of LogisticRegression in Industry Settings – Fraud Detection Detecting Fraudulent Credit Card Transactions • The probability that a transaction is using a stolen credit card is typically estimated with logistic regression • Billions of transactions are analyzed each day
  • 13.
    How Slow isthe Batch Version of Logistic Regression? One target variable and two feature vectors. All randomly generated.
  • 14.
    A Real WorldProblem
  • 15.
    A Real WorldProblem • Some stats on the training job in the pipeline: • Runs training jobs on a per country basis • Longest running job lasts ~9 hours • Shortest running job lasts ~3 hours • There are often convergence failures • What we need an algorithm that: • Can reduce training time • Is robust towards convergence failures
  • 16.
    A Big DataFriendly Approach Online Training • Pass each data point sequentially through the algorithm • Only requires one data point at a time in memory • Allows for on-the-fly training of the algorithm
  • 17.
    Online Learning • Wewant to learn a vector of weights • Initialize all weights. Begin loop: 1. Get training example 2. Make a prediction for the target variable 3. Learn the true value of the target 4. Update the weights and go to 1
  • 18.
    Online Learning • Initialiseall weights. Begin loop: Repeat { For i = 1 to m { 𝜃𝑗 = 𝜃𝑗 − 𝛼 𝜕 𝜕𝜃 𝑗 𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖, 𝑦𝑖)) } } the partial derivative of the cost functions the cost function – given theta and row i, i.e. how wrong Are we? the step size – how fast we should climb the gradient
  • 19.
    Online Learning • Approachesthe maximum of the function in a jumpy manner and never actually settles on the maximum.
  • 20.
    Batch vs. OnlineLearning Data Size: 4.8GB Rows: 500,000 Columns: 5000 0 20 40 60 80 100 120 Batch SGDClassifier Sofia-ml Training *Times include reading data and training algorithm
  • 21.
    Online Learning Vs.Batch Online Learning • When we have a continuous stream of data • When It is important to update the algorithm in real time – can hit a moving target • When training speed is important • Parameters are “jumpy” around the optimal values Batch • When it is very important to get the exact optimal values • When data can fit in memory • When training time is not of the essence
  • 22.
    Popular Online LearningLibraries • Sofia-ml (c/c++) • Requires data in svmLight format • Have implementations of SVM, Neural networks and logistic regression • Supports classification and ranking • Wovbal wabbit (c/c++) • Requires data in own wv format • Have implementations of the most popular loss functions • Supports classification, ranking and regression • Pandas + scikit-learn (python) • Pandas has a nice function for reading files in batches • Can handle sparse and non-sparse matrices • Scikit–learn has an SGD classifier that can fit the model in batches • Supports classification, ranking and regression