2. Agenda
Part 1: What is “machine learning”?
Part 2: Finding patterns in an actual data set
2
3. “Machine Learning”: Finding patterns in data
• Famous “Iris” data set has measurements for 150 flowers
• Given a flower’s measurements, can we predict its species?
Iris setosa Iris versicolor Iris virginica
4
4. Petal Width (cm)
Petal Length (cm)
Iris setosa, red dots
Iris versicolor, green dots
Iris virginica, blue dots
5
5. Petal Width (cm)
Petal Length (cm)
Congratulations! You just trained a model.
6
7. Petal Width (cm)
Prediction: Iris virginica
Prediction: Iris versicolor
Prediction: Iris setosa
Petal Length (cm)
Prediction:
Congratulations! You just scored four Iris virginica
previously unseen flowers using your
model, and made a prediction about
the species of each one.
8
8. • Data is just a table of values
• Each row is an “instance”, an
example of the concept to be
learned
• Each column is an “attribute” or
“feature” of the instance
• The column we want to predict is
the “label”
9
12. Training versus Scoring
• This process had two steps: training and scoring
• When training on historical data, you’re often looking for patterns
that emerge over weeks, months or even years
• When scoring new data points, you want the answer immediately
(in “real time”)
13
13. Do you really need to train in “real time”?
• Many real-world cases rely heavily on historical data
• Credit scores, fraud detection, movie ratings, web search relevance, disease
diagnosis, customer churn, yield on a silicon wafer …
• Extreme example: text recognition!
• You might add fresh training data daily or hourly, but you will still
have lots of historical data in the training set.
• You definitely want to score in real time, because you’re typically
using this model in some sort of app
14
15. What “Real Time” Really Means
• The next time you hear someone talk about “real time”
machine learning, make yourself look really smart and ask if
they mean training or scoring
16
16. • W
What do you mean,
real time training or
real time scoring?
What? I don’t …
18
17. The StumbleUpon Dataset
• StumbleUpon is an app that recommends web pages
• Dataset of 7,400 web pages is provided, with each page labeled as
either “evergreen” or “ephemeral”
• We want to predict the page’s class using this historical data
19
While some pages we recommend, such as news
articles or seasonal recipes, are only relevant for a
short period of time, others maintain a timeless
quality and can be recommended to users long after
they are discovered. In other words, pages can
either be classified as "ephemeral" or "evergreen".
18. Training a model on StumbleUpon data
• Live demo: training a model on StumbleUpon data
• Key concepts:
• “Bag of words” text analysis
• Evaluating the model using a holdout set
• Combining multiple models to improve accuracy
20
19. Final Thought
• The two datasets we trained on were not “big”
• Iris dataset: 150 rows, less than 5K
• StumbleUpon dataset: 7400 rows, 21MB
• Data doesn’t need to be big to be useful
21