SlideShare a Scribd company logo
1 of 68
Download to read offline
Machine Learning Specialization
Scaling to Huge Datasets &
Online Learning
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization2
Data
Why gradient ascent is slow…
©2015-2016 Emily Fox & Carlos Guestrin
w(t)
Compute
gradientPass
over data
w(t+1)
Update
coefficients
w(t+2)
Update
coefficients
Δ
ℓ(w(t))	
Δ
ℓ(w(t+1))	
Every update requires
a full pass over data
Machine Learning Specialization3
Data sets are getting huge, and we need them!
©2015-2016 Emily Fox & Carlos Guestrin
5B views/day
500M Tweets/dayWWW
4.8B webpages
300 hours
uploaded/min
1 B users,
ad revenue
Need ML algorithm to learn from
billions of video views every day, &
to recommend ads within milliseconds
Internet of
Things
Sensors
everywhere
Machine Learning Specialization5
ML improves (significantly)
with bigger datasets
©2015-2016 Emily Fox & Carlos Guestrin
1996 2006 2016
Big data
Simple
•  Needed for
speed
Logistic regression
Matrix factorization
Bigger data
Complex
•  Needed for accuracy
•  Parallelism, GPUs,
computer clusters
Boosted trees
Tensor factorization
Deep learning
Massive graphical models“The Unreasonable Effectiveness of Data”
[Halevy, Norvig, Pereira ’09]
Small data
Data
size
Complex
•  Needed for
accuracy
Model
complexity
Kernels
Graphical models
Hot
topics
Machine Learning Specialization7 ©2015-2016 Emily Fox & Carlos Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
y ŵ
h(x)x
This module: small
change to algorithm,
which often improves
running time a lot
Machine Learning Specialization8
Data
Stochastic gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
w(t)
Compute
gradientUse only
small subsets
of data
w(t+1)
Update
coefficients
w(t+2)
Update
coefficients
w(t+3)
Update
coefficients
w(t+4)
Update
coefficients
Many updates for
each pass over data
Machine Learning Specialization
Learning, one data point at a time
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization11
Gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
Algorithm:
while not converged
w(t+1) ß w(t) + η ℓ(w(t))
Δ
Machine Learning Specialization12
How expensive is gradient ascent?
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Contribution of
data point xi,yi to gradient
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
@`i(w)
@wj
Machine Learning Specialization13
Every step requires touching
every data point!!!
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
@`i(w)
@wj
Time to compute
contribution of xi,yi
# of data points (N)
Total time to compute
1 step of gradient ascent
1 millisecond 1000
1 second 1000
1 millisecond 10 million
1 millisecond 10 billion
Time to compute
contribution of xi,yi
# of data points (N)
Total time to compute
1 step of gradient ascent
1 millisecond 1000
1 second 1000
1 millisecond 10 million
Time to compute
contribution of xi,yi
# of data points (N)
Total time to compute
1 step of gradient ascent
1 millisecond 1000
1 second 1000
Time to compute
contribution of xi,yi
# of data points (N)
Total time to compute
1 step of gradient ascent
1 millisecond 1000
Machine Learning Specialization15
Instead of all data points for gradient,
use 1 data point only???
©2015-2016 Emily Fox & Carlos Guestrin
Gradient
ascent
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
@`i(w)
@wj
Stochastic
gradient
ascent
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`i(w)
@wj
≈
Each time, pick
different data point i
Machine Learning Specialization
Stochastic gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization18
Stochastic gradient ascent
for logistic regression
©2015-2016 Emily Fox & Carlos Guestrin
init w(1)=0, t=1
until converged
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w(t)
)
⌘for j=0,…,D
partial[j] =
wj
(t+1) ß wj
(t) + η partial[j]
t ß t + 1
for i=1,…,N
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘
Sum over
data points
Each time, pick
different data point i
Machine Learning Specialization19
Comparing computational time per step
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`i(w)
@wj
Time to compute
contribution of xi,yi
# of data points (N)
Total time for
1 step of
gradient
Total time for
1 step of
stochastic gradient
1 millisecond 1000 1 second
1 second 1000 16.7 minutes
1 millisecond 10 million 2.8 hours
1 millisecond 10 billion 115.7 days
Gradient ascent
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w
@`i(w)
@wj
≈
Stochastic gradient ascent
Machine Learning Specialization
Comparing gradient to
stochastic gradient
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization23
Which one is better??? Depends…
Algorithm
Time per
iteration
Gradient
Slow for
large data
Stochastic
gradient
Always
fast
©2015-2016 Emily Fox & Carlos Guestrin
Algorithm
Time per
iteration
In theory
Gradient
Slow for
large data
Slower
Stochastic
gradient
Always
fast
Faster
Algorithm
Time per
iteration
In theory In practice
Gradient
Slow for
large data
Slower
Often
slower
Stochastic
gradient
Always
fast
Faster
Often
faster
Algorithm
Time per
iteration
In theory In practice
Sensitivity to
parameters
Gradient
Slow for
large data
Slower
Often
slower
Moderate
Stochastic
gradient
Always
fast
Faster
Often
faster
Very high
Total time to
convergence for large data
Machine Learning Specialization25
Avg.loglikelihood
Comparing gradient to stochastic gradient
©2015-2016 Emily Fox & Carlos Guestrin
Better
Gradient
Stochastic
gradient
Total time proportional to
number of passes over data
Stochastic gradient achieves higher
likelihood sooner, but it’s noisier
Machine Learning Specialization26
Eventually, gradient catches up
©2015-2016 Emily Fox & Carlos Guestrin
Avg.loglikelihood
Better
Gradient
Stochastic
gradient
Note: should only trust “average” quality
of stochastic gradient (more discussion later)
Machine Learning Specialization27
Summary of stochastic gradient
Tiny change to gradient ascent
Much better scalability
Huge impact in real-world
Very tricky to get right in practice
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Why would stochastic
gradient ever work???
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization30
Gradient is direction of steepest ascent
©2015-2016 Emily Fox & Carlos Guestrin
Gradient is “best” direction, but
any direction that goes “up” would be useful
Machine Learning Specialization31
In ML, steepest direction is
sum of “little directions”
from each data point
©2015-2016 Emily Fox & Carlos Guestrin
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1]
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1]
Sum over
data points
@`i(w)
@wj
For most data points,
contribution points “up”
Machine Learning Specialization32
Stochastic gradient: pick a
data point and move in direction
©2015-2016 Emily Fox & Carlos Guestrin
Most of the time,
total likelihood will increase
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1
@`i(w)
@wj
≈
Machine Learning Specialization33
Stochastic gradient ascent:
Most iterations increase likelihood,
but sometimes decrease it è
On average, make progress
©2015-2016 Emily Fox & Carlos Guestrin
until converged
for i=1,…,N
for j=0,…,D
wj
(t+1) ß wj
(t) + η
t ß t + 1
@`i(w)
@wj
Machine Learning Specialization
Convergence path
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization37
Convergence paths
©2015-2016 Emily Fox & Carlos Guestrin
Gradient
Stochastic
gradient
Machine Learning Specialization38
Stochastic gradient convergence is “noisy”
©2015-2016 Emily Fox & Carlos Guestrin
Better
Avg.loglikelihood
Gradient usually increases
likelihood smoothly
Stochastic gradient
makes “noisy” progress
Machine Learning Specialization40
Summary of why stochastic gradient works
Gradient finds direction of
steeps ascent
Gradient is sum of contributions
from each data point
Stochastic gradient uses
direction from 1 data point
On average increases likelihood,
sometimes decreases
Stochastic gradient has “noisy”
convergence
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Stochastic gradient:
practical tricks
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization43
Stochastic gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
init w(1)=0, t=1
until converged
for i=1,…,N
for j=0,…,D
wj
(t+1) ß wj
(t) + η
t ß t + 1
@`i(w)
@wj
Machine Learning Specialization44
Order of data can introduce bias
©2015-2016 Emily Fox & Carlos Guestrin
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
Stochastic gradient
updates parameters
1 data point at a time
Systematic order in data can introduce significant bias,
e.g., all negative points first, or temporal order, younger first, or …
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
x[1] = #awesome x[2] = #awful y = sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
2 1 +1
4 1 +1
1 1 +1
2 1 +1
Machine Learning Specialization45
Shuffle data before running
stochastic gradient!
©2015-2016 Emily Fox & Carlos Guestrin
x[1] =
#awesome
x[2] =
#awful
y =
sentiment
1 1 +1
3 3 -1
0 2 -1
4 1 +1
2 1 +1
2 4 -1
0 1 -1
0 3 -1
2 1 +1
x[1] =
#awesome
x[2] =
#awful
y =
sentiment
0 2 -1
3 3 -1
2 4 -1
0 3 -1
0 1 -1
2 1 +1
4 1 +1
1 1 +1
2 1 +1
Shuffle
rows
Machine Learning Specialization47
Stochastic gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
init w(1)=0, t=1
until converged
for i=1,…,N
for j=0,…,D
wj
(t+1) ß wj
(t) + η
t ß t + 1
@`i(w)
@wj
Shuffle data	
Before running
stochastic gradient,
make sure data is shuffled
Machine Learning Specialization
Choosing the step size η
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization49
Picking step size
for stochastic gradient
is very similar to
picking step size
for gradient
©2015-2016 Emily Fox & Carlos Guestrin
But stochastic gradient
is a lot more unstable… L
Machine Learning Specialization51
If step size is too small,
stochastic gradient slow to converge
©2015-2016 Emily Fox & Carlos Guestrin
Avg.loglikelihood
Better
Machine Learning Specialization52
If step size is too large,
stochastic gradient oscillates
©2015-2016 Emily Fox & Carlos Guestrin
Avg.loglikelihood
Better
Machine Learning Specialization53
If step size is very large,
stochastic gradient goes crazy L
©2015-2016 Emily Fox & Carlos Guestrin
Avg.loglikelihood
Better
Machine Learning Specialization54
Simple rule of thumb for picking step size η
similar to gradient
•  Unfortunately, picking step size requires a lot a lot
of trial and error, much worst than gradient L
•  Try a several values, exponentially spaced
- Goal: plot learning curves to
•  find one η that is too small
•  find one η that is too large
•  Advanced tip: step size that decreases with iterations is
very important for stochastic gradient, e.g.,
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Don’t trust the last coefficients… L
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization57
Stochastic gradient never fully “converges”
©2015-2016 Emily Fox & Carlos Guestrin
Avg.loglikelihood
Better
Gradient will
eventually
stabilize on
a solution
Stochastic gradient will eventually
oscillate around a solution
Machine Learning Specialization58
The last coefficients may be really good or really bad!! L
©2015-2016 Emily Fox & Carlos Guestrin
w(1000) was bad	
w(1005) was good	
How do we
minimize risk of
picking bad
coefficients
Stochastic gradient will eventually
oscillate around a solution
Machine Learning Specialization59
Stochastic gradient returns average coefficients
•  Minimize noise:
don’t return last learned coefficients
•  Instead, output average:
©2015-2016 Emily Fox & Carlos Guestrin
ŵ = 1 w(t)
T
TX
t=1
Machine Learning Specialization
Learning from batches of data
©2015-2016 Emily Fox & Carlos Guestrin
OPTIONAL
Machine Learning Specialization61
Gradient/stochastic gradient:
two extremes
©2015-2016 Emily Fox & Carlos Guestrin
DataData
Stochastic gradient:
1 point/iteration
Gradient:
N points/iteration
Mini-batches:
B points/iteration
Reduces noise,
increases stability
Machine Learning Specialization63
Convergence paths
©2015-2016 Emily Fox & Carlos Guestrin
Stochastic gradient
Batch size = 1
Gradient
Stochastic gradient
Batch size = 25
Machine Learning Specialization64
Batch size effect
Avg.loglikelihood
Better
Too large è Slow convergence
Too small è Noisy,
may not converge
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization66
Stochastic gradient ascent
with mini-batches
©2015-2016 Emily Fox & Carlos Guestrin
Shuffle data	
init w(1)=0, t=1
until converged
for j=0,…,D
wj
(t+1) ß wj
(t) + η
t ß t + 1
@`i(w)
@wj
Sum over
data points in
mini-batch k
(k+1)⇤B
X
i=1+k⇤B
For each
mini-batch
for k=0,…,N/B-1
Machine Learning Specialization
Measuring convergence
©2015-2016 Emily Fox & Carlos Guestrin
OPTIONAL
Machine Learning Specialization68
Avg.loglikelihood
Better
How did we make these plots???
Need to compute log likelihood of
data at every iteration???
©2015-2016 Emily Fox & Carlos Guestrin
è Really really slow,
product over all data points!
`(w) = ln
NY
i=1
P(yi | xi, w)
Actually, plotting average
log likelihood over
last 30 mini-batches…
Machine Learning Specialization70
Computing log-likelihood during
run of stochastic gradient ascent
©2015-2016 Emily Fox & Carlos Guestrin
init w(1)=0, t=1
until converged
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w(t)
)
⌘for j=0,…,D
partial[j] =
wj
(t+1) ß wj
(t) + η partial[j]
t ß t + 1
for i=1,…,N
Log-likelihood of data point i is simply:
`i(w(t)
) =
8
<
:
ln P(y = +1 | xi, w(t)
), if yi = +1
ln 1 P(y = +1 | xi, w(t)
) , if yi = 1
Machine Learning Specialization72
`i(w(t)
)=
8
<
:
lnP(y=+1
ln1P(y=
Estimate log-likelihood with sliding window
©2015-2016 Emily Fox & Carlos Guestrin
Iteration t
Log-likelihood
ofsingledatapoint
t=75
Report average
of last k values
Inexpensive estimate of
log-likelihood;
useful to evaluate convergence
Machine Learning Specialization73
That’s what average log-likelihood meant… J
(In this case, over last k=30 mini-batches, with batch-size B = 100)
©2015-2016 Emily Fox & Carlos Guestrin
Avg.loglikelihood
Better
Machine Learning Specialization
Adding regularization
©2015-2016 Emily Fox & Carlos Guestrin
OPTIONAL
Machine Learning Specialization75
Consider specific total cost
Total quality =
measure of fit - measure of magnitude
of coefficients
©2015 Emily Fox & Carlos Guestrin
ℓ(w)(w) ||w||2
2
Machine Learning Specialization76
Gradient of L2 regularized log-likelihood
Total quality =
measure of fit - measure of magnitude
of coefficients
©2015 Emily Fox & Carlos Guestrin
Total
derivative =	
λ ||w||2
2	
- 2 λ wj	
ℓ(w)(w)
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`i(w)
@wj
Machine Learning Specialization77
Stochastic gradient for regularized objective
•  What about regularization term?
©2015-2016 Emily Fox & Carlos Guestrin
Total
derivative =	 - 2 λ wj	
@`(w)
@wj
=
NX
i=1
hj(xi)
⇣
1[yi = +1] P(y = +1 | xi, w)
⌘@`i(w)
@wj
Stochastic
gradient
ascent
@`i(w)
@wj
Each time, pick
different data point i
≈Total
derivative
- 2 λ wj
N	
Each data point contributes
1/N to regularization
Machine Learning Specialization78
Stochastic gradient ascent with regularization
©2015-2016 Emily Fox & Carlos Guestrin
init w(1)=0, t=1
until converged
for i=1,…,N
for j=0,…,D
wj
(t+1) ß wj
(t) + η
t ß t + 1
@`i(w)
@wj
Shuffle data	
- 2 λ wj
N
Machine Learning Specialization
Online learning:
Fitting models from
streaming data
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization
Batch vs online learning
Batch learning
•  All data is available at
start of training time
Online learning
•  Data arrives (streams in) over time
-  Must train model as data arrives!
©2015-2016 Emily Fox & Carlos Guestrin
Data
ML
algorithm
time
ŵ
Data
t=1
Data
t=2
ML
algorithm
ŵ(1)
Data
t=3
Data
t=4
ŵ(2)ŵ(3)ŵ(4)
Machine Learning Specialization83
Online learning example:
Ad targeting
©2015-2016 Emily Fox & Carlos Guestrin
Website
….......….......
….......….......
….......….......
….......….......
….......….......
….......….......
….......….......
….......….......
Input: xt
User info,
page text
ML
algorithm
ŵ(t)
Ad1
Ad2
Ad3
ŷ = Suggested ads
User clicked
on Ad2
ê
yt=Ad2
ŵ(t+1)
Machine Learning Specialization84
Online learning problem
•  Data arrives over each time step t:
- Observe input xt
•  Info of user, text of webpage
- Make a prediction ŷt
•  Which ad to show
- Observe true output yt
•  Which ad user clicked on
©2015-2016 Emily Fox & Carlos Guestrin
Need ML algorithm to
update coefficients each time step!
Machine Learning Specialization85
Stochastic gradient ascent
can be used for online learning!!!
©2015-2016 Emily Fox & Carlos Guestrin
•  init w(1)=0, t=1
•  Each time step t:
- Observe input xt
- Make a prediction ŷt
- Observe true output yt
- Update coefficients:
for j=0,…,D
wj
(t+1) ß wj
(t) + η @`t(w)
@wj
Machine Learning Specialization86
Summary of online learning
Data arrives over time
Must make a prediction every
time new data point arrives
Observe true class after
prediction made
Want to update parameters
immediately
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization87
Updating coefficients immediately:
Pros and Cons
©2015-2016 Emily Fox & Carlos Guestrin
Pros
•  Model always up to date è
Often more accurate
•  Lower computational cost
•  Don’t need to store all data, but often do anyway
Cons
•  Overall system is *much* more complex
-  Bad real-world cost in terms of $$$ to build & maintain
Most companies opt for systems that save data
and update coefficients every night, or hour, week,…
Machine Learning Specialization
Summary of scaling to
huge datasets & online learning
©2015-2016 Emily Fox & Carlos Guestrin
Machine Learning Specialization89
Scaling through parallelism
©2015-2016 Emily Fox & Carlos Guestrin
Multicore
processors
Computer
clusters
Huge opportunity for scaling
Require parallel & distributed
ML algorithms
Machine Learning Specialization90
What you can do now…
•  Significantly speedup learning
algorithm using stochastic gradient
•  Describe intuition behind why
stochastic gradient works
•  Apply stochastic gradient in practice
•  Describe online learning problems
•  Relate stochastic gradient to online
learning
©2015-2016 Emily Fox & Carlos Guestrin

More Related Content

Viewers also liked

Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...
Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...
Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...Global CCS Institute
 
Overview of Current Directions in Carbon Capture R&D
Overview of Current Directions in Carbon Capture R&DOverview of Current Directions in Carbon Capture R&D
Overview of Current Directions in Carbon Capture R&DGlobal CCS Institute
 
Stick&filament (fili in stecche) listino prezzi pubblico
Stick&filament (fili in stecche) listino prezzi pubblicoStick&filament (fili in stecche) listino prezzi pubblico
Stick&filament (fili in stecche) listino prezzi pubblicoStefano Corinaldesi
 
Roadmap for CCS Demonstration and Deployment for the People's Republic of China
Roadmap for CCS Demonstration and Deployment for the People's Republic of ChinaRoadmap for CCS Demonstration and Deployment for the People's Republic of China
Roadmap for CCS Demonstration and Deployment for the People's Republic of ChinaGlobal CCS Institute
 
Pix mix board game how many are there
Pix mix board game  how many are therePix mix board game  how many are there
Pix mix board game how many are thereCaroline Liu
 
The rain came down story
The rain came down storyThe rain came down story
The rain came down storyCaroline Liu
 
Key issues and barriers in developing countries
Key issues and barriers in developing countriesKey issues and barriers in developing countries
Key issues and barriers in developing countriesGlobal CCS Institute
 
Hazel kazkayası mimari portfolyo
Hazel kazkayası mimari portfolyoHazel kazkayası mimari portfolyo
Hazel kazkayası mimari portfolyoHazel Kazkayasi
 
Conflict management & resolution
Conflict management & resolutionConflict management & resolution
Conflict management & resolutionMuhammad Saad
 
Fly traps reading comprehension questions
Fly traps reading comprehension questionsFly traps reading comprehension questions
Fly traps reading comprehension questionsCaroline Liu
 
Tonight is Carnival
Tonight is CarnivalTonight is Carnival
Tonight is CarnivalCaroline Liu
 
What do authors do pretest
What do authors do  pretestWhat do authors do  pretest
What do authors do pretestCaroline Liu
 

Viewers also liked (17)

Sesion las 3 rs pruducción 2
Sesion las 3 rs pruducción  2Sesion las 3 rs pruducción  2
Sesion las 3 rs pruducción 2
 
Ti14
Ti14Ti14
Ti14
 
Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...
Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...
Yanchang Petroleum CCS Project - Enhanced oil recovery using CO2 in North Wes...
 
Overview of Current Directions in Carbon Capture R&D
Overview of Current Directions in Carbon Capture R&DOverview of Current Directions in Carbon Capture R&D
Overview of Current Directions in Carbon Capture R&D
 
Stick&filament (fili in stecche) listino prezzi pubblico
Stick&filament (fili in stecche) listino prezzi pubblicoStick&filament (fili in stecche) listino prezzi pubblico
Stick&filament (fili in stecche) listino prezzi pubblico
 
Roadmap for CCS Demonstration and Deployment for the People's Republic of China
Roadmap for CCS Demonstration and Deployment for the People's Republic of ChinaRoadmap for CCS Demonstration and Deployment for the People's Republic of China
Roadmap for CCS Demonstration and Deployment for the People's Republic of China
 
Pix mix board game how many are there
Pix mix board game  how many are therePix mix board game  how many are there
Pix mix board game how many are there
 
Extruder graphite tube insert
Extruder graphite tube insertExtruder graphite tube insert
Extruder graphite tube insert
 
The rain came down story
The rain came down storyThe rain came down story
The rain came down story
 
Key issues and barriers in developing countries
Key issues and barriers in developing countriesKey issues and barriers in developing countries
Key issues and barriers in developing countries
 
SMOOTHED PLA TEST 3
SMOOTHED PLA TEST 3SMOOTHED PLA TEST 3
SMOOTHED PLA TEST 3
 
Hazel kazkayası mimari portfolyo
Hazel kazkayası mimari portfolyoHazel kazkayası mimari portfolyo
Hazel kazkayası mimari portfolyo
 
Conflict management & resolution
Conflict management & resolutionConflict management & resolution
Conflict management & resolution
 
SMOOTHED PLA TEST 4
SMOOTHED PLA TEST 4SMOOTHED PLA TEST 4
SMOOTHED PLA TEST 4
 
Fly traps reading comprehension questions
Fly traps reading comprehension questionsFly traps reading comprehension questions
Fly traps reading comprehension questions
 
Tonight is Carnival
Tonight is CarnivalTonight is Carnival
Tonight is Carnival
 
What do authors do pretest
What do authors do  pretestWhat do authors do  pretest
What do authors do pretest
 

Similar to C3.7.1

Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural SimulationClarkTony
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learningsimaokasonse
 
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media contentMining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media contentVasileios Lampos
 
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear ClassifiersLecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear ClassifiersMarina Santini
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper ReadingTakato Yamazaki
 
Transferring Software Testing and Analytics Tools to Practice
Transferring Software Testing and Analytics Tools to PracticeTransferring Software Testing and Analytics Tools to Practice
Transferring Software Testing and Analytics Tools to PracticeTao Xie
 
Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016chris wiggins
 
Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice (AST 2017 Keynote)Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice (AST 2017 Keynote)Tao Xie
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchAhmed BESBES
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsSeonho Park
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural NetworkLiwei Ren任力偉
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?ProCogia
 

Similar to C3.7.1 (20)

C3.2.1
C3.2.1C3.2.1
C3.2.1
 
C3.2.2
C3.2.2C3.2.2
C3.2.2
 
C2.2
C2.2C2.2
C2.2
 
C2.4
C2.4C2.4
C2.4
 
C3.5.1
C3.5.1C3.5.1
C3.5.1
 
Actors for Behavioural Simulation
Actors for Behavioural SimulationActors for Behavioural Simulation
Actors for Behavioural Simulation
 
Classification
ClassificationClassification
Classification
 
Learning Deep Learning
Learning Deep LearningLearning Deep Learning
Learning Deep Learning
 
Mining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media contentMining socio-political and socio-economic signals from social media content
Mining socio-political and socio-economic signals from social media content
 
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear ClassifiersLecture 03: Machine Learning for Language Technology - Linear Classifiers
Lecture 03: Machine Learning for Language Technology - Linear Classifiers
 
Double Q-learning Paper Reading
Double Q-learning Paper ReadingDouble Q-learning Paper Reading
Double Q-learning Paper Reading
 
Mathematical Finance
Mathematical Finance Mathematical Finance
Mathematical Finance
 
Transferring Software Testing and Analytics Tools to Practice
Transferring Software Testing and Analytics Tools to PracticeTransferring Software Testing and Analytics Tools to Practice
Transferring Software Testing and Analytics Tools to Practice
 
Machine Learning Summer School 2016
Machine Learning Summer School 2016Machine Learning Summer School 2016
Machine Learning Summer School 2016
 
Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice (AST 2017 Keynote)Transferring Software Testing Tools to Practice (AST 2017 Keynote)
Transferring Software Testing Tools to Practice (AST 2017 Keynote)
 
Introduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from ScratchIntroduction to Neural Networks and Deep Learning from Scratch
Introduction to Neural Networks and Deep Learning from Scratch
 
Predictive Testing
Predictive TestingPredictive Testing
Predictive Testing
 
Stochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithmsStochastic optimization from mirror descent to recent algorithms
Stochastic optimization from mirror descent to recent algorithms
 
Introduction to Deep Neural Network
Introduction to Deep Neural NetworkIntroduction to Deep Neural Network
Introduction to Deep Neural Network
 
Is your excel production code?
Is your excel production code?Is your excel production code?
Is your excel production code?
 

More from Daniel LIAO (17)

C4.5
C4.5C4.5
C4.5
 
C4.4
C4.4C4.4
C4.4
 
C4.3.1
C4.3.1C4.3.1
C4.3.1
 
Lime
LimeLime
Lime
 
C4.1.2
C4.1.2C4.1.2
C4.1.2
 
C4.1.1
C4.1.1C4.1.1
C4.1.1
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
C3.4.2
C3.4.2C3.4.2
C3.4.2
 
C3.4.1
C3.4.1C3.4.1
C3.4.1
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
 
C2.3
C2.3C2.3
C2.3
 
C2.6
C2.6C2.6
C2.6
 
C2.5
C2.5C2.5
C2.5
 
C3.1.2
C3.1.2C3.1.2
C3.1.2
 
C3.1.logistic intro
C3.1.logistic introC3.1.logistic intro
C3.1.logistic intro
 
C3.1 intro
C3.1 introC3.1 intro
C3.1 intro
 
C2.1 intro
C2.1 introC2.1 intro
C2.1 intro
 

Recently uploaded

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhikauryashika82
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introductionMaksud Ahmed
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Sapana Sha
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room servicediscovermytutordmt
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Celine George
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3JemimahLaneBuaron
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024Janet Corral
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxVishalSingh1417
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfsanyamsingh5019
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...christianmathematics
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxheathfieldcps1
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)eniolaolutunde
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajanpragatimahajan3
 

Recently uploaded (20)

Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
microwave assisted reaction. General introduction
microwave assisted reaction. General introductionmicrowave assisted reaction. General introduction
microwave assisted reaction. General introduction
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111Call Girls in Dwarka Mor Delhi Contact Us 9654467111
Call Girls in Dwarka Mor Delhi Contact Us 9654467111
 
9548086042 for call girls in Indira Nagar with room service
9548086042  for call girls in Indira Nagar  with room service9548086042  for call girls in Indira Nagar  with room service
9548086042 for call girls in Indira Nagar with room service
 
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
Mattingly "AI & Prompt Design: Structured Data, Assistants, & RAG"
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3Q4-W6-Restating Informational Text Grade 3
Q4-W6-Restating Informational Text Grade 3
 
General AI for Medical Educators April 2024
General AI for Medical Educators April 2024General AI for Medical Educators April 2024
General AI for Medical Educators April 2024
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Unit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptxUnit-IV- Pharma. Marketing Channels.pptx
Unit-IV- Pharma. Marketing Channels.pptx
 
Sanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdfSanyam Choudhary Chemistry practical.pdf
Sanyam Choudhary Chemistry practical.pdf
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1Código Creativo y Arte de Software | Unidad 1
Código Creativo y Arte de Software | Unidad 1
 
The basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptxThe basics of sentences session 2pptx copy.pptx
The basics of sentences session 2pptx copy.pptx
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)Software Engineering Methodologies (overview)
Software Engineering Methodologies (overview)
 
social pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajansocial pharmacy d-pharm 1st year by Pragati K. Mahajan
social pharmacy d-pharm 1st year by Pragati K. Mahajan
 

C3.7.1

  • 1. Machine Learning Specialization Scaling to Huge Datasets & Online Learning Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington ©2015-2016 Emily Fox & Carlos Guestrin
  • 2. Machine Learning Specialization2 Data Why gradient ascent is slow… ©2015-2016 Emily Fox & Carlos Guestrin w(t) Compute gradientPass over data w(t+1) Update coefficients w(t+2) Update coefficients Δ ℓ(w(t)) Δ ℓ(w(t+1)) Every update requires a full pass over data
  • 3. Machine Learning Specialization3 Data sets are getting huge, and we need them! ©2015-2016 Emily Fox & Carlos Guestrin 5B views/day 500M Tweets/dayWWW 4.8B webpages 300 hours uploaded/min 1 B users, ad revenue Need ML algorithm to learn from billions of video views every day, & to recommend ads within milliseconds Internet of Things Sensors everywhere
  • 4. Machine Learning Specialization5 ML improves (significantly) with bigger datasets ©2015-2016 Emily Fox & Carlos Guestrin 1996 2006 2016 Big data Simple •  Needed for speed Logistic regression Matrix factorization Bigger data Complex •  Needed for accuracy •  Parallelism, GPUs, computer clusters Boosted trees Tensor factorization Deep learning Massive graphical models“The Unreasonable Effectiveness of Data” [Halevy, Norvig, Pereira ’09] Small data Data size Complex •  Needed for accuracy Model complexity Kernels Graphical models Hot topics
  • 5. Machine Learning Specialization7 ©2015-2016 Emily Fox & Carlos Guestrin Training Data Feature extraction ML model Quality metric ML algorithm y ŵ h(x)x This module: small change to algorithm, which often improves running time a lot
  • 6. Machine Learning Specialization8 Data Stochastic gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin w(t) Compute gradientUse only small subsets of data w(t+1) Update coefficients w(t+2) Update coefficients w(t+3) Update coefficients w(t+4) Update coefficients Many updates for each pass over data
  • 7. Machine Learning Specialization Learning, one data point at a time ©2015-2016 Emily Fox & Carlos Guestrin
  • 8. Machine Learning Specialization11 Gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin Algorithm: while not converged w(t+1) ß w(t) + η ℓ(w(t)) Δ
  • 9. Machine Learning Specialization12 How expensive is gradient ascent? ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Contribution of data point xi,yi to gradient @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ @`i(w) @wj
  • 10. Machine Learning Specialization13 Every step requires touching every data point!!! ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points @`i(w) @wj Time to compute contribution of xi,yi # of data points (N) Total time to compute 1 step of gradient ascent 1 millisecond 1000 1 second 1000 1 millisecond 10 million 1 millisecond 10 billion Time to compute contribution of xi,yi # of data points (N) Total time to compute 1 step of gradient ascent 1 millisecond 1000 1 second 1000 1 millisecond 10 million Time to compute contribution of xi,yi # of data points (N) Total time to compute 1 step of gradient ascent 1 millisecond 1000 1 second 1000 Time to compute contribution of xi,yi # of data points (N) Total time to compute 1 step of gradient ascent 1 millisecond 1000
  • 11. Machine Learning Specialization15 Instead of all data points for gradient, use 1 data point only??? ©2015-2016 Emily Fox & Carlos Guestrin Gradient ascent @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points @`i(w) @wj Stochastic gradient ascent @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`i(w) @wj ≈ Each time, pick different data point i
  • 12. Machine Learning Specialization Stochastic gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin
  • 13. Machine Learning Specialization18 Stochastic gradient ascent for logistic regression ©2015-2016 Emily Fox & Carlos Guestrin init w(1)=0, t=1 until converged NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w(t) ) ⌘for j=0,…,D partial[j] = wj (t+1) ß wj (t) + η partial[j] t ß t + 1 for i=1,…,N @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘ Sum over data points Each time, pick different data point i
  • 14. Machine Learning Specialization19 Comparing computational time per step ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`i(w) @wj Time to compute contribution of xi,yi # of data points (N) Total time for 1 step of gradient Total time for 1 step of stochastic gradient 1 millisecond 1000 1 second 1 second 1000 16.7 minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion 115.7 days Gradient ascent @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w @`i(w) @wj ≈ Stochastic gradient ascent
  • 15. Machine Learning Specialization Comparing gradient to stochastic gradient ©2015-2016 Emily Fox & Carlos Guestrin
  • 16. Machine Learning Specialization23 Which one is better??? Depends… Algorithm Time per iteration Gradient Slow for large data Stochastic gradient Always fast ©2015-2016 Emily Fox & Carlos Guestrin Algorithm Time per iteration In theory Gradient Slow for large data Slower Stochastic gradient Always fast Faster Algorithm Time per iteration In theory In practice Gradient Slow for large data Slower Often slower Stochastic gradient Always fast Faster Often faster Algorithm Time per iteration In theory In practice Sensitivity to parameters Gradient Slow for large data Slower Often slower Moderate Stochastic gradient Always fast Faster Often faster Very high Total time to convergence for large data
  • 17. Machine Learning Specialization25 Avg.loglikelihood Comparing gradient to stochastic gradient ©2015-2016 Emily Fox & Carlos Guestrin Better Gradient Stochastic gradient Total time proportional to number of passes over data Stochastic gradient achieves higher likelihood sooner, but it’s noisier
  • 18. Machine Learning Specialization26 Eventually, gradient catches up ©2015-2016 Emily Fox & Carlos Guestrin Avg.loglikelihood Better Gradient Stochastic gradient Note: should only trust “average” quality of stochastic gradient (more discussion later)
  • 19. Machine Learning Specialization27 Summary of stochastic gradient Tiny change to gradient ascent Much better scalability Huge impact in real-world Very tricky to get right in practice ©2015-2016 Emily Fox & Carlos Guestrin
  • 20. Machine Learning Specialization Why would stochastic gradient ever work??? ©2015-2016 Emily Fox & Carlos Guestrin
  • 21. Machine Learning Specialization30 Gradient is direction of steepest ascent ©2015-2016 Emily Fox & Carlos Guestrin Gradient is “best” direction, but any direction that goes “up” would be useful
  • 22. Machine Learning Specialization31 In ML, steepest direction is sum of “little directions” from each data point ©2015-2016 Emily Fox & Carlos Guestrin @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] Sum over data points @`i(w) @wj For most data points, contribution points “up”
  • 23. Machine Learning Specialization32 Stochastic gradient: pick a data point and move in direction ©2015-2016 Emily Fox & Carlos Guestrin Most of the time, total likelihood will increase @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1 @`i(w) @wj ≈
  • 24. Machine Learning Specialization33 Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it è On average, make progress ©2015-2016 Emily Fox & Carlos Guestrin until converged for i=1,…,N for j=0,…,D wj (t+1) ß wj (t) + η t ß t + 1 @`i(w) @wj
  • 25. Machine Learning Specialization Convergence path ©2015-2016 Emily Fox & Carlos Guestrin
  • 26. Machine Learning Specialization37 Convergence paths ©2015-2016 Emily Fox & Carlos Guestrin Gradient Stochastic gradient
  • 27. Machine Learning Specialization38 Stochastic gradient convergence is “noisy” ©2015-2016 Emily Fox & Carlos Guestrin Better Avg.loglikelihood Gradient usually increases likelihood smoothly Stochastic gradient makes “noisy” progress
  • 28. Machine Learning Specialization40 Summary of why stochastic gradient works Gradient finds direction of steeps ascent Gradient is sum of contributions from each data point Stochastic gradient uses direction from 1 data point On average increases likelihood, sometimes decreases Stochastic gradient has “noisy” convergence ©2015-2016 Emily Fox & Carlos Guestrin
  • 29. Machine Learning Specialization Stochastic gradient: practical tricks ©2015-2016 Emily Fox & Carlos Guestrin
  • 30. Machine Learning Specialization43 Stochastic gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin init w(1)=0, t=1 until converged for i=1,…,N for j=0,…,D wj (t+1) ß wj (t) + η t ß t + 1 @`i(w) @wj
  • 31. Machine Learning Specialization44 Order of data can introduce bias ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 Stochastic gradient updates parameters 1 data point at a time Systematic order in data can introduce significant bias, e.g., all negative points first, or temporal order, younger first, or … x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1 2 1 +1 4 1 +1 1 1 +1 2 1 +1
  • 32. Machine Learning Specialization45 Shuffle data before running stochastic gradient! ©2015-2016 Emily Fox & Carlos Guestrin x[1] = #awesome x[2] = #awful y = sentiment 1 1 +1 3 3 -1 0 2 -1 4 1 +1 2 1 +1 2 4 -1 0 1 -1 0 3 -1 2 1 +1 x[1] = #awesome x[2] = #awful y = sentiment 0 2 -1 3 3 -1 2 4 -1 0 3 -1 0 1 -1 2 1 +1 4 1 +1 1 1 +1 2 1 +1 Shuffle rows
  • 33. Machine Learning Specialization47 Stochastic gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin init w(1)=0, t=1 until converged for i=1,…,N for j=0,…,D wj (t+1) ß wj (t) + η t ß t + 1 @`i(w) @wj Shuffle data Before running stochastic gradient, make sure data is shuffled
  • 34. Machine Learning Specialization Choosing the step size η ©2015-2016 Emily Fox & Carlos Guestrin
  • 35. Machine Learning Specialization49 Picking step size for stochastic gradient is very similar to picking step size for gradient ©2015-2016 Emily Fox & Carlos Guestrin But stochastic gradient is a lot more unstable… L
  • 36. Machine Learning Specialization51 If step size is too small, stochastic gradient slow to converge ©2015-2016 Emily Fox & Carlos Guestrin Avg.loglikelihood Better
  • 37. Machine Learning Specialization52 If step size is too large, stochastic gradient oscillates ©2015-2016 Emily Fox & Carlos Guestrin Avg.loglikelihood Better
  • 38. Machine Learning Specialization53 If step size is very large, stochastic gradient goes crazy L ©2015-2016 Emily Fox & Carlos Guestrin Avg.loglikelihood Better
  • 39. Machine Learning Specialization54 Simple rule of thumb for picking step size η similar to gradient •  Unfortunately, picking step size requires a lot a lot of trial and error, much worst than gradient L •  Try a several values, exponentially spaced - Goal: plot learning curves to •  find one η that is too small •  find one η that is too large •  Advanced tip: step size that decreases with iterations is very important for stochastic gradient, e.g., ©2015-2016 Emily Fox & Carlos Guestrin
  • 40. Machine Learning Specialization Don’t trust the last coefficients… L ©2015-2016 Emily Fox & Carlos Guestrin
  • 41. Machine Learning Specialization57 Stochastic gradient never fully “converges” ©2015-2016 Emily Fox & Carlos Guestrin Avg.loglikelihood Better Gradient will eventually stabilize on a solution Stochastic gradient will eventually oscillate around a solution
  • 42. Machine Learning Specialization58 The last coefficients may be really good or really bad!! L ©2015-2016 Emily Fox & Carlos Guestrin w(1000) was bad w(1005) was good How do we minimize risk of picking bad coefficients Stochastic gradient will eventually oscillate around a solution
  • 43. Machine Learning Specialization59 Stochastic gradient returns average coefficients •  Minimize noise: don’t return last learned coefficients •  Instead, output average: ©2015-2016 Emily Fox & Carlos Guestrin ŵ = 1 w(t) T TX t=1
  • 44. Machine Learning Specialization Learning from batches of data ©2015-2016 Emily Fox & Carlos Guestrin OPTIONAL
  • 45. Machine Learning Specialization61 Gradient/stochastic gradient: two extremes ©2015-2016 Emily Fox & Carlos Guestrin DataData Stochastic gradient: 1 point/iteration Gradient: N points/iteration Mini-batches: B points/iteration Reduces noise, increases stability
  • 46. Machine Learning Specialization63 Convergence paths ©2015-2016 Emily Fox & Carlos Guestrin Stochastic gradient Batch size = 1 Gradient Stochastic gradient Batch size = 25
  • 47. Machine Learning Specialization64 Batch size effect Avg.loglikelihood Better Too large è Slow convergence Too small è Noisy, may not converge ©2015-2016 Emily Fox & Carlos Guestrin
  • 48. Machine Learning Specialization66 Stochastic gradient ascent with mini-batches ©2015-2016 Emily Fox & Carlos Guestrin Shuffle data init w(1)=0, t=1 until converged for j=0,…,D wj (t+1) ß wj (t) + η t ß t + 1 @`i(w) @wj Sum over data points in mini-batch k (k+1)⇤B X i=1+k⇤B For each mini-batch for k=0,…,N/B-1
  • 49. Machine Learning Specialization Measuring convergence ©2015-2016 Emily Fox & Carlos Guestrin OPTIONAL
  • 50. Machine Learning Specialization68 Avg.loglikelihood Better How did we make these plots??? Need to compute log likelihood of data at every iteration??? ©2015-2016 Emily Fox & Carlos Guestrin è Really really slow, product over all data points! `(w) = ln NY i=1 P(yi | xi, w) Actually, plotting average log likelihood over last 30 mini-batches…
  • 51. Machine Learning Specialization70 Computing log-likelihood during run of stochastic gradient ascent ©2015-2016 Emily Fox & Carlos Guestrin init w(1)=0, t=1 until converged NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w(t) ) ⌘for j=0,…,D partial[j] = wj (t+1) ß wj (t) + η partial[j] t ß t + 1 for i=1,…,N Log-likelihood of data point i is simply: `i(w(t) ) = 8 < : ln P(y = +1 | xi, w(t) ), if yi = +1 ln 1 P(y = +1 | xi, w(t) ) , if yi = 1
  • 52. Machine Learning Specialization72 `i(w(t) )= 8 < : lnP(y=+1 ln1P(y= Estimate log-likelihood with sliding window ©2015-2016 Emily Fox & Carlos Guestrin Iteration t Log-likelihood ofsingledatapoint t=75 Report average of last k values Inexpensive estimate of log-likelihood; useful to evaluate convergence
  • 53. Machine Learning Specialization73 That’s what average log-likelihood meant… J (In this case, over last k=30 mini-batches, with batch-size B = 100) ©2015-2016 Emily Fox & Carlos Guestrin Avg.loglikelihood Better
  • 54. Machine Learning Specialization Adding regularization ©2015-2016 Emily Fox & Carlos Guestrin OPTIONAL
  • 55. Machine Learning Specialization75 Consider specific total cost Total quality = measure of fit - measure of magnitude of coefficients ©2015 Emily Fox & Carlos Guestrin ℓ(w)(w) ||w||2 2
  • 56. Machine Learning Specialization76 Gradient of L2 regularized log-likelihood Total quality = measure of fit - measure of magnitude of coefficients ©2015 Emily Fox & Carlos Guestrin Total derivative = λ ||w||2 2 - 2 λ wj ℓ(w)(w) @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`i(w) @wj
  • 57. Machine Learning Specialization77 Stochastic gradient for regularized objective •  What about regularization term? ©2015-2016 Emily Fox & Carlos Guestrin Total derivative = - 2 λ wj @`(w) @wj = NX i=1 hj(xi) ⇣ 1[yi = +1] P(y = +1 | xi, w) ⌘@`i(w) @wj Stochastic gradient ascent @`i(w) @wj Each time, pick different data point i ≈Total derivative - 2 λ wj N Each data point contributes 1/N to regularization
  • 58. Machine Learning Specialization78 Stochastic gradient ascent with regularization ©2015-2016 Emily Fox & Carlos Guestrin init w(1)=0, t=1 until converged for i=1,…,N for j=0,…,D wj (t+1) ß wj (t) + η t ß t + 1 @`i(w) @wj Shuffle data - 2 λ wj N
  • 59. Machine Learning Specialization Online learning: Fitting models from streaming data ©2015-2016 Emily Fox & Carlos Guestrin
  • 60. Machine Learning Specialization Batch vs online learning Batch learning •  All data is available at start of training time Online learning •  Data arrives (streams in) over time -  Must train model as data arrives! ©2015-2016 Emily Fox & Carlos Guestrin Data ML algorithm time ŵ Data t=1 Data t=2 ML algorithm ŵ(1) Data t=3 Data t=4 ŵ(2)ŵ(3)ŵ(4)
  • 61. Machine Learning Specialization83 Online learning example: Ad targeting ©2015-2016 Emily Fox & Carlos Guestrin Website ….......…....... ….......…....... ….......…....... ….......…....... ….......…....... ….......…....... ….......…....... ….......…....... Input: xt User info, page text ML algorithm ŵ(t) Ad1 Ad2 Ad3 ŷ = Suggested ads User clicked on Ad2 ê yt=Ad2 ŵ(t+1)
  • 62. Machine Learning Specialization84 Online learning problem •  Data arrives over each time step t: - Observe input xt •  Info of user, text of webpage - Make a prediction ŷt •  Which ad to show - Observe true output yt •  Which ad user clicked on ©2015-2016 Emily Fox & Carlos Guestrin Need ML algorithm to update coefficients each time step!
  • 63. Machine Learning Specialization85 Stochastic gradient ascent can be used for online learning!!! ©2015-2016 Emily Fox & Carlos Guestrin •  init w(1)=0, t=1 •  Each time step t: - Observe input xt - Make a prediction ŷt - Observe true output yt - Update coefficients: for j=0,…,D wj (t+1) ß wj (t) + η @`t(w) @wj
  • 64. Machine Learning Specialization86 Summary of online learning Data arrives over time Must make a prediction every time new data point arrives Observe true class after prediction made Want to update parameters immediately ©2015-2016 Emily Fox & Carlos Guestrin
  • 65. Machine Learning Specialization87 Updating coefficients immediately: Pros and Cons ©2015-2016 Emily Fox & Carlos Guestrin Pros •  Model always up to date è Often more accurate •  Lower computational cost •  Don’t need to store all data, but often do anyway Cons •  Overall system is *much* more complex -  Bad real-world cost in terms of $$$ to build & maintain Most companies opt for systems that save data and update coefficients every night, or hour, week,…
  • 66. Machine Learning Specialization Summary of scaling to huge datasets & online learning ©2015-2016 Emily Fox & Carlos Guestrin
  • 67. Machine Learning Specialization89 Scaling through parallelism ©2015-2016 Emily Fox & Carlos Guestrin Multicore processors Computer clusters Huge opportunity for scaling Require parallel & distributed ML algorithms
  • 68. Machine Learning Specialization90 What you can do now… •  Significantly speedup learning algorithm using stochastic gradient •  Describe intuition behind why stochastic gradient works •  Apply stochastic gradient in practice •  Describe online learning problems •  Relate stochastic gradient to online learning ©2015-2016 Emily Fox & Carlos Guestrin