Thomas Jensen. Machine Learning

The Impact of Big Data on
Classic Machine Learning
Algorithms
Thomas Jensen, Senior Business Analyst @ Expedia

Who am I?
• Senior Business Analyst @ Expedia
• Working within the competitive
intelligence unit
• Responsible for :
• Algorithm that score new hotels
• Algorithm that predicts room nights
sold on existing Expedia hotels
• Scraping competitor sites
• Other stuff….

The Promise of Big Data
Real time data
Data driven decision
More accurate and
robust models
Granularity

Big Data Challenges
Data Processing – not
going to talk about
this.
Speed at which to use
data – how fast should
we update
algorithms?
How do we train
algorithms on data
sets that do not fit
into memory?

Big Data Challenges
Taken from: http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram

Classification - Logistic Regression
• One classic task in machine learning / statistics is to classify some
objects/events/decisions correctly
• Examples are:
• Customer churn
• Click behavior
• Purchase behavior
• ….
• One of the most popular algorithms to carry out these tasks is logistic
regression

What is logistic regression?
• Logistic regression attaches probabilities to individual outcomes,
showing how likely they are to belong to one class or the other
• Pr 𝑦 𝑥 =
1
1+𝑒−𝑥𝛽
• The challenge is to choose the
optimal beta(s)
• To do that we minimize a cost
function

Why Use Logistic Regression?
• It is simple and well understood algorithm
• Outputs probabilities
• There are tried and tested models to estimate the parameters
• It is flexible – can handle a number of different inputs, and feature
transformations

Usual Approaches
• Batch training (offline approach)
• Get all the data and train the algorithm in one go
• Disadvantages when data is big
• Requires all data to be loaded into memory
• Periodic retraining is necessary
• Very time consuming with big data!

Examples of Logistic Regression in Industry
Settings – Real Time Bidding
• RTB
• RTB algorithms are usually
based on logistic regression
• Whether or not to bid on a
user is determined by the
probability that the user will
click on an add
• Each day billions of bids are
processed
• Each bid has to be processed
within 80 milliseconds

Examples of Logistic Regression in Industry
Settings – Fraud Detection
Detecting Fraudulent Credit Card
Transactions
• The probability that a transaction
is using a stolen credit card is
typically estimated with logistic
regression
• Billions of transactions are
analyzed each day

How Slow is the Batch Version of Logistic
Regression?
One target variable and two feature vectors.
All randomly generated.

A Real World Problem
• Some stats on the training job in the pipeline:
• Runs training jobs on a per country basis
• Longest running job lasts ~9 hours
• Shortest running job lasts ~3 hours
• There are often convergence failures
• What we need an algorithm that:
• Can reduce training time
• Is robust towards convergence failures

A Big Data Friendly Approach
Online Training
• Pass each data point sequentially through the algorithm
• Only requires one data point at a time in memory
• Allows for on-the-fly training of the algorithm

Online Learning
• We want to learn a vector of
weights
• Initialize all weights. Begin loop:
1. Get training example
2. Make a prediction for the target
variable
3. Learn the true value of the
target
4. Update the weights and go to 1

Online Learning
• Initialise all weights. Begin loop:
Repeat {
For i = 1 to m {
𝜃𝑗 = 𝜃𝑗 − 𝛼
𝜕
𝜕𝜃 𝑗
𝑐𝑜𝑠𝑡(𝜃, (𝑥𝑖, 𝑦𝑖))
}
}
the partial derivative
of the cost functions
the cost function – given
theta and row i, i.e. how wrong
Are we?
the step size – how fast
we should climb the
gradient

Online Learning
• Approaches the maximum of the function in a jumpy manner and
never actually settles on the maximum.

Batch vs. Online Learning
Data
Size: 4.8GB
Rows: 500,000
Columns: 5000
0
20
40
60
80
100
120
Batch SGDClassifier Sofia-ml
Training
*Times include reading data and training algorithm

Online Learning Vs. Batch
Online Learning
• When we have a continuous
stream of data
• When It is important to update
the algorithm in real time – can
hit a moving target
• When training speed is
important
• Parameters are “jumpy” around
the optimal values
Batch
• When it is very important to get
the exact optimal values
• When data can fit in memory
• When training time is not of the
essence

Popular Online Learning Libraries
• Sofia-ml (c/c++)
• Requires data in svmLight format
• Have implementations of SVM, Neural networks and logistic regression
• Supports classification and ranking
• Wovbal wabbit (c/c++)
• Requires data in own wv format
• Have implementations of the most popular loss functions
• Supports classification, ranking and regression
• Pandas + scikit-learn (python)
• Pandas has a nice function for reading files in batches
• Can handle sparse and non-sparse matrices
• Scikit–learn has an SGD classifier that can fit the model in batches
• Supports classification, ranking and regression

Thomas Jensen. Machine Learning

Thomas Jensen. Machine Learning

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (8)

Similar to Thomas Jensen. Machine Learning

Similar to Thomas Jensen. Machine Learning (20)

More from Volha Banadyseva

More from Volha Banadyseva (20)

Recently uploaded

Recently uploaded (20)

Thomas Jensen. Machine Learning