Hands-on Machine Learning 
David Gerster 
VP Data Science
Agenda 
Part 1: What is “machine learning”? 
Part 2: Finding patterns in an actual data set 
2
“Machine Learning”: Finding patterns in data 
• Famous “Iris” data set has measurements for 150 flowers 
• Given a flower’s measurements, can we predict its species? 
Iris setosa Iris versicolor Iris virginica 
4
Petal Width (cm) 
Petal Length (cm) 
Iris setosa, red dots 
Iris versicolor, green dots 
Iris virginica, blue dots 
5
Petal Width (cm) 
Petal Length (cm) 
Congratulations! You just trained a model. 
6
Petal Width (cm) 
Petal Length (cm) 
Prediction: Iris virginica 
Prediction: Iris versicolor 
Prediction: Iris setosa 
Prediction: 
Iris virginica 
7
Petal Width (cm) 
Prediction: Iris virginica 
Prediction: Iris versicolor 
Prediction: Iris setosa 
Petal Length (cm) 
Prediction: 
Congratulations! You just scored four Iris virginica 
previously unseen flowers using your 
model, and made a prediction about 
the species of each one. 
8
• Data is just a table of values 
• Each row is an “instance”, an 
example of the concept to be 
learned 
• Each column is an “attribute” or 
“feature” of the instance 
• The column we want to predict is 
the “label” 
9
10 
Try out the Iris data set at
Try out the Iris data set at 
11
That was easy! … So What? 
12
Training versus Scoring 
• This process had two steps: training and scoring 
• When training on historical data, you’re often looking for patterns 
that emerge over weeks, months or even years 
• When scoring new data points, you want the answer immediately 
(in “real time”) 
13
Do you really need to train in “real time”? 
• Many real-world cases rely heavily on historical data 
• Credit scores, fraud detection, movie ratings, web search relevance, disease 
diagnosis, customer churn, yield on a silicon wafer … 
• Extreme example: text recognition! 
• You might add fresh training data daily or hourly, but you will still 
have lots of historical data in the training set. 
• You definitely want to score in real time, because you’re typically 
using this model in some sort of app 
14
15
What “Real Time” Really Means 
• The next time you hear someone talk about “real time” 
machine learning, make yourself look really smart and ask if 
they mean training or scoring 
16
• W 
What do you mean, 
real time training or 
real time scoring? 
What? I don’t … 
18
The StumbleUpon Dataset 
• StumbleUpon is an app that recommends web pages 
• Dataset of 7,400 web pages is provided, with each page labeled as 
either “evergreen” or “ephemeral” 
• We want to predict the page’s class using this historical data 
19 
While some pages we recommend, such as news 
articles or seasonal recipes, are only relevant for a 
short period of time, others maintain a timeless 
quality and can be recommended to users long after 
they are discovered. In other words, pages can 
either be classified as "ephemeral" or "evergreen".
Training a model on StumbleUpon data 
• Live demo: training a model on StumbleUpon data 
• Key concepts: 
• “Bag of words” text analysis 
• Evaluating the model using a holdout set 
• Combining multiple models to improve accuracy 
20
Final Thought 
• The two datasets we trained on were not “big” 
• Iris dataset: 150 rows, less than 5K 
• StumbleUpon dataset: 7400 rows, 21MB 
• Data doesn’t need to be big to be useful 
21
22

David Gerster: Hands on Machine Learning

  • 1.
    Hands-on Machine Learning David Gerster VP Data Science
  • 2.
    Agenda Part 1:What is “machine learning”? Part 2: Finding patterns in an actual data set 2
  • 3.
    “Machine Learning”: Findingpatterns in data • Famous “Iris” data set has measurements for 150 flowers • Given a flower’s measurements, can we predict its species? Iris setosa Iris versicolor Iris virginica 4
  • 4.
    Petal Width (cm) Petal Length (cm) Iris setosa, red dots Iris versicolor, green dots Iris virginica, blue dots 5
  • 5.
    Petal Width (cm) Petal Length (cm) Congratulations! You just trained a model. 6
  • 6.
    Petal Width (cm) Petal Length (cm) Prediction: Iris virginica Prediction: Iris versicolor Prediction: Iris setosa Prediction: Iris virginica 7
  • 7.
    Petal Width (cm) Prediction: Iris virginica Prediction: Iris versicolor Prediction: Iris setosa Petal Length (cm) Prediction: Congratulations! You just scored four Iris virginica previously unseen flowers using your model, and made a prediction about the species of each one. 8
  • 8.
    • Data isjust a table of values • Each row is an “instance”, an example of the concept to be learned • Each column is an “attribute” or “feature” of the instance • The column we want to predict is the “label” 9
  • 9.
    10 Try outthe Iris data set at
  • 10.
    Try out theIris data set at 11
  • 11.
    That was easy!… So What? 12
  • 12.
    Training versus Scoring • This process had two steps: training and scoring • When training on historical data, you’re often looking for patterns that emerge over weeks, months or even years • When scoring new data points, you want the answer immediately (in “real time”) 13
  • 13.
    Do you reallyneed to train in “real time”? • Many real-world cases rely heavily on historical data • Credit scores, fraud detection, movie ratings, web search relevance, disease diagnosis, customer churn, yield on a silicon wafer … • Extreme example: text recognition! • You might add fresh training data daily or hourly, but you will still have lots of historical data in the training set. • You definitely want to score in real time, because you’re typically using this model in some sort of app 14
  • 14.
  • 15.
    What “Real Time”Really Means • The next time you hear someone talk about “real time” machine learning, make yourself look really smart and ask if they mean training or scoring 16
  • 16.
    • W Whatdo you mean, real time training or real time scoring? What? I don’t … 18
  • 17.
    The StumbleUpon Dataset • StumbleUpon is an app that recommends web pages • Dataset of 7,400 web pages is provided, with each page labeled as either “evergreen” or “ephemeral” • We want to predict the page’s class using this historical data 19 While some pages we recommend, such as news articles or seasonal recipes, are only relevant for a short period of time, others maintain a timeless quality and can be recommended to users long after they are discovered. In other words, pages can either be classified as "ephemeral" or "evergreen".
  • 18.
    Training a modelon StumbleUpon data • Live demo: training a model on StumbleUpon data • Key concepts: • “Bag of words” text analysis • Evaluating the model using a holdout set • Combining multiple models to improve accuracy 20
  • 19.
    Final Thought •The two datasets we trained on were not “big” • Iris dataset: 150 rows, less than 5K • StumbleUpon dataset: 7400 rows, 21MB • Data doesn’t need to be big to be useful 21
  • 20.