2nd edition
#MLSEV 2
Supervised vs. Unsupervised
The Lay of the Land
Charles Parker
VP Algorithms, BigML, Inc
#MLSEV 3
Machine Learning Landscape
It’s not just you - this stuff can be hard
• Do I have data on which I can do
machine learning?
• Do I have a problem to which I can apply
machine learning?
• Should I apply machine learning to that
problem?
• What if my problem doesn’t match a
traditional machine learning problem?
#MLSEV 4
ML-Ready Data
#MLSEV 5
Getting Your Data In Order
• Data takes many shapes and sizes
• Databases
• Collections of multimedia files
• Log files
• The largest class of ML algorithms
generally expect your data in a
tabular form
• If it isn’t in that form, you’ve got to
get it there
#MLSEV 6
Rows: What do you want to Know About?
• Each row is a thing that you want to have
more information about
• Churn prediction: Each row is a customer
• Medical diagnosis: Each row is a patient
• Credit card fraud: Each row is a
transaction
• Market closing price prediction: Each row
is a day
#MLSEV 7
Columns: What Information do you Have?
• Each column is a piece of information you can get about
the thing represented by the row
• Churn prediction: Last month’s bill, number of times
support was called, whether or not the customer churned
• Medical diagnosis: Body temperature, BMI, whether or not
the patient has a disease
• Credit card fraud: Transaction geolocation, the address of
the card holder, the amount
• Market closing price prediction: Opening price, volume
that day, day of week
#MLSEV 8
Feature Engineering
• You should in general try to reduce complex data to features that are either
numeric or categorical (i.e., a variable with a finite set of possible values)
• Aside: A good categorical feature should have no unique values and the set of
possible values should be small (less than 10 is good, less than 100 is maybe okay)
• Text data can be reduced to counts of informative words
• Strings representing a date can be reduced to the parts of the date (month, year,
day of month, day of week)
• Sometimes, you must do this yourself, but in some common cases it can be
automated (BigML does this for text and date time data)
#MLSEV 9
Special Case: Aggregation
• If the rows in your data do not match the
thing you want to know about, you need
to do some sort of data transformation
• Problem: You have a table of
transactions, but you want to do
customer segmentation
• Solution: Create features that are per-
customer aggregations (e.g., total
number of transactions, average
purchase size, etc.)
#MLSEV 10
Supervised Learning
#MLSEV 11
Learning From Data
• In supervised learning, one of those columns is special and
is variously called the “objective”, “target variable” or
“label”.
• This is something we know when we’re training, but don’t
know at prediction time (but wish we did)
• Supervised machine learning creates a program to
predict that value from the other values in the training
data
• Said another way, it creats a program that transforms
the things you know into the things you want to know
#MLSEV 12
The Objective Field
• The column in your data that you don’t
know in advance, but wish you did
• Churn prediction: Whether or not the
customer churned
• Medical diagnosis: Whether or not the
patent has a disease
• Credit card fraud: Whether or not the
transaction was fraudulent
• Market closing price prediction: The
closing price of the market that day
#MLSEV 13
So Many Algorithms!
• All supervised learning algorithms are
doing the same thing
• So why are there so many of them?
• Different algorithms make different assumptions
about the function they’re trying to fit
• Different algorithms have very different
performance characteristics
• The “right” algorithm depends on the
problem you’re trying to solve and the
data that you’re using to solve it
#MLSEV 14
A Simple Algorithmic Ontology
Amount of data required Linear models < trees, ensembles < deep learning
Potential to overfit Linear models < ensembles < trees, deep learning
Speed Linear models, trees < ensembles < deep learning
Representational Power Linear models < trees < ensembles < deep learning
• How much data do you have
• How fast do you need things to go?
• How much performance do you really need?
#MLSEV 15
The Triple Tradeoff
Prediction error
Training data size Algorithmic power
#MLSEV 16
Unsupervised Learning
#MLSEV 17
I Have Nothing To Predict!
• What if there is no objective column? Is all
lost?
• Which segment does this customer fit into?
• What is this collection of documents about?
• What are some strong correlations in this dataset?
• Find me some points that are odd in this data
• This is unsupervised learning
• Unsupervised learning creates a structure
that explains all or part of the data
#MLSEV 18
Supervised Learning
Predict customer churn from the rest of the
features, like calls to support and last
month’s bill
• We have a bunch of columns we
know, both now and when we
make a prediction
• We have one column that we know
now, but would like to know
without having to acquire the
answer again
• Use the former to predict the latter
#MLSEV 19
Clustering
The best way to break these customers up
into three groups is group 1, with one
customer, group 2, with three customers and
group 3 with two customers
• We have a bunch of columns we
know, but nothing to predict
• We'd like to see the groups this
data “naturally” falls into
• Applications: Customer
Segmentation, Recommendation
#MLSEV 20
Association Rules
If the last months bill was greater than $200
and the user called support more than twice
then the customer usually churns
• We have a bunch of columns
• We don't have a specific prediction
in mind, but we’d like to see simple
rules where one thing predicts
another
• Applications: Market basket
analysis, data exploration, simple
modeling
#MLSEV 21
Topic Modeling
Create a model of the topics that best
explain these text fields
• We have text data
• We’d like to know what this text
data is “about”, in terms of groups
of words that tend to occur
together
• Applications: Document discovery,
preprocessing for classification
#MLSEV 22
Anomaly Detection
This combination of feature values is unusual
amongst all combinations of values in the
dataset
• We have a bunch of rows
• We know most of them are the
same in some way (they are the
"usual case”)
• But a very few are not normal
• We'd like to find these very few
• Applications: Fraud detection, data
cleaning
#MLSEV 23
Hold On A Second!
#MLSEV 24
I’m Sure You’re All Very Excited
• You’ve got data and a problem that ML
can solve. Great!
• Now, should you use ML to solve that
problem?
• What are some useful ways to think about
that question?
#MLSEV 25
Expert System: Expert And Programmer
• Historical computerized expert systems
are based on the knowledge of two
people
• The expert is the person with the
domain knowledge and experience
• The programmer interacts with the
expert and creates a computer program
based on their knowledge
• They may be the same person, but you
need both
#MLSEV 26
Machine Learning: Data and Algorithm
• In machine learning, data replaces the expert
and algorithms replace programmers
• Data is often more reliable and sometimes
easier to get than human expertise
• Algorithms work faster, generate more
complex programs, and are more modular
than human programmers
• Machine learning is a good idea when
you can leverage these advantages
#MLSEV 27
#1: People Can’t Tell You How They Do It
• Cases where everyone can do this thing, but
it’s hard for them to explain how they do it
• Many computer vision tasks
• Speech recognition
• Lots of NLP problems (e.g., document
classification)
• Many spatial navigation problems
• Bonus if many people have to do this thing
#MLSEV 28
#2: Human Experts are Expensive
• Cases where it’s tough to get your
hands on an expert, or their
knowledge is too deep to be readily
programmed
• Medical diagnosis
• Game-playing at high levels
• Autonomous helicopter piloting
#MLSEV 29
#3: Everyone Gets Their Own Algorithm
• Cases where a specific model in thousands
of locations would be better than one big
system, and each location is generating the
data necessary to create one
• Location prediction (via mobile)
• Spam detection (from content)
• Demand prediction
#MLSEV 30
#4: Every Little Bit Counts
• Cases where performance is the
overarching concern, even at very small
increments
• Market trading, financial modeling
• High volume vision tasks where mistakes
are costly
• Some product recommendation problems
#MLSEV 31
Some Negative Examples
• Human experts are cheap and easy to come
by (lots of examples in NLP and vision)
• Performance of humans is better (though it
may be slower and more expensive)
• A competent program can easily be written
by hand
• The data is difficult to acquire and/or to label
#MLSEV 32
The Great Beyond
#MLSEV 33
Am I In The Right Room?
• There are lots of problems solvable via ML
that don’t fall exactly into any of these
buckets
• Machine translation
• Image Segmentation
• Game-playing
• “Matching” problems
• Let’s talk about a few of these other
buckets now
#MLSEV 34
Label Sequence Learning
Part-of-speech tagging, OCR, multimedia annotation
We played outside yesterday
pronoun
 verb
 adverb
 adverb

• Predictions come in a sequence
• Correct value may be “context dependent”
#MLSEV 35
Metric Learning
Document matching, query processing, recommendation
• Learn an embedding for the data
• In the new space, things that are related
should be close together, and unrelated
things should be far apart
• The objective isn’t usually a column, but a
list of pairs of things that are “related” and
“unrelated”
#MLSEV 36
Reinforcement Learning
Game playing, planning, control systems
• Predictions are sequential and when you make one
(take an action), it influences the next prediction you
have to make (next state)
• You may or may not get a reward when you take an
action in a certain state
• Taking a certain action in a certain state might not
always result in the same next state or reward
• The action space is often infinite, structured, and/or
conditional on the state
• You learn from a simulator instead of a dataset
#MLSEV 37
A Solution Might Be At Hand
• All of these problems have their own
algorithms, even their own niches in
the academic literature
• Sometimes, solutions can be
assembled by using standard
algorithms as parts of a solution
• See, for example, sliding window
classifiers
• . . . and WhizzML!
#MLSEV 38
Summary
• The quickest way to machine learning
is to get tabular data
• If you know what you want to predict,
it’s supervised learning, if you don’t,
it’s unsupervised learning
• Machine learning will only work if you
can leverage the advantages of the
data + algorithm paradigm
MLSEV Virtual. Supervised vs Unsupervised

MLSEV Virtual. Supervised vs Unsupervised

  • 1.
  • 2.
    #MLSEV 2 Supervised vs.Unsupervised The Lay of the Land Charles Parker VP Algorithms, BigML, Inc
  • 3.
    #MLSEV 3 Machine LearningLandscape It’s not just you - this stuff can be hard • Do I have data on which I can do machine learning? • Do I have a problem to which I can apply machine learning? • Should I apply machine learning to that problem? • What if my problem doesn’t match a traditional machine learning problem?
  • 4.
  • 5.
    #MLSEV 5 Getting YourData In Order • Data takes many shapes and sizes • Databases • Collections of multimedia files • Log files • The largest class of ML algorithms generally expect your data in a tabular form • If it isn’t in that form, you’ve got to get it there
  • 6.
    #MLSEV 6 Rows: Whatdo you want to Know About? • Each row is a thing that you want to have more information about • Churn prediction: Each row is a customer • Medical diagnosis: Each row is a patient • Credit card fraud: Each row is a transaction • Market closing price prediction: Each row is a day
  • 7.
    #MLSEV 7 Columns: WhatInformation do you Have? • Each column is a piece of information you can get about the thing represented by the row • Churn prediction: Last month’s bill, number of times support was called, whether or not the customer churned • Medical diagnosis: Body temperature, BMI, whether or not the patient has a disease • Credit card fraud: Transaction geolocation, the address of the card holder, the amount • Market closing price prediction: Opening price, volume that day, day of week
  • 8.
    #MLSEV 8 Feature Engineering •You should in general try to reduce complex data to features that are either numeric or categorical (i.e., a variable with a finite set of possible values) • Aside: A good categorical feature should have no unique values and the set of possible values should be small (less than 10 is good, less than 100 is maybe okay) • Text data can be reduced to counts of informative words • Strings representing a date can be reduced to the parts of the date (month, year, day of month, day of week) • Sometimes, you must do this yourself, but in some common cases it can be automated (BigML does this for text and date time data)
  • 9.
    #MLSEV 9 Special Case:Aggregation • If the rows in your data do not match the thing you want to know about, you need to do some sort of data transformation • Problem: You have a table of transactions, but you want to do customer segmentation • Solution: Create features that are per- customer aggregations (e.g., total number of transactions, average purchase size, etc.)
  • 10.
  • 11.
    #MLSEV 11 Learning FromData • In supervised learning, one of those columns is special and is variously called the “objective”, “target variable” or “label”. • This is something we know when we’re training, but don’t know at prediction time (but wish we did) • Supervised machine learning creates a program to predict that value from the other values in the training data • Said another way, it creats a program that transforms the things you know into the things you want to know
  • 12.
    #MLSEV 12 The ObjectiveField • The column in your data that you don’t know in advance, but wish you did • Churn prediction: Whether or not the customer churned • Medical diagnosis: Whether or not the patent has a disease • Credit card fraud: Whether or not the transaction was fraudulent • Market closing price prediction: The closing price of the market that day
  • 13.
    #MLSEV 13 So ManyAlgorithms! • All supervised learning algorithms are doing the same thing • So why are there so many of them? • Different algorithms make different assumptions about the function they’re trying to fit • Different algorithms have very different performance characteristics • The “right” algorithm depends on the problem you’re trying to solve and the data that you’re using to solve it
  • 14.
    #MLSEV 14 A SimpleAlgorithmic Ontology Amount of data required Linear models < trees, ensembles < deep learning Potential to overfit Linear models < ensembles < trees, deep learning Speed Linear models, trees < ensembles < deep learning Representational Power Linear models < trees < ensembles < deep learning • How much data do you have • How fast do you need things to go? • How much performance do you really need?
  • 15.
    #MLSEV 15 The TripleTradeoff Prediction error Training data size Algorithmic power
  • 16.
  • 17.
    #MLSEV 17 I HaveNothing To Predict! • What if there is no objective column? Is all lost? • Which segment does this customer fit into? • What is this collection of documents about? • What are some strong correlations in this dataset? • Find me some points that are odd in this data • This is unsupervised learning • Unsupervised learning creates a structure that explains all or part of the data
  • 18.
    #MLSEV 18 Supervised Learning Predictcustomer churn from the rest of the features, like calls to support and last month’s bill • We have a bunch of columns we know, both now and when we make a prediction • We have one column that we know now, but would like to know without having to acquire the answer again • Use the former to predict the latter
  • 19.
    #MLSEV 19 Clustering The bestway to break these customers up into three groups is group 1, with one customer, group 2, with three customers and group 3 with two customers • We have a bunch of columns we know, but nothing to predict • We'd like to see the groups this data “naturally” falls into • Applications: Customer Segmentation, Recommendation
  • 20.
    #MLSEV 20 Association Rules Ifthe last months bill was greater than $200 and the user called support more than twice then the customer usually churns • We have a bunch of columns • We don't have a specific prediction in mind, but we’d like to see simple rules where one thing predicts another • Applications: Market basket analysis, data exploration, simple modeling
  • 21.
    #MLSEV 21 Topic Modeling Createa model of the topics that best explain these text fields • We have text data • We’d like to know what this text data is “about”, in terms of groups of words that tend to occur together • Applications: Document discovery, preprocessing for classification
  • 22.
    #MLSEV 22 Anomaly Detection Thiscombination of feature values is unusual amongst all combinations of values in the dataset • We have a bunch of rows • We know most of them are the same in some way (they are the "usual case”) • But a very few are not normal • We'd like to find these very few • Applications: Fraud detection, data cleaning
  • 23.
  • 24.
    #MLSEV 24 I’m SureYou’re All Very Excited • You’ve got data and a problem that ML can solve. Great! • Now, should you use ML to solve that problem? • What are some useful ways to think about that question?
  • 25.
    #MLSEV 25 Expert System:Expert And Programmer • Historical computerized expert systems are based on the knowledge of two people • The expert is the person with the domain knowledge and experience • The programmer interacts with the expert and creates a computer program based on their knowledge • They may be the same person, but you need both
  • 26.
    #MLSEV 26 Machine Learning:Data and Algorithm • In machine learning, data replaces the expert and algorithms replace programmers • Data is often more reliable and sometimes easier to get than human expertise • Algorithms work faster, generate more complex programs, and are more modular than human programmers • Machine learning is a good idea when you can leverage these advantages
  • 27.
    #MLSEV 27 #1: PeopleCan’t Tell You How They Do It • Cases where everyone can do this thing, but it’s hard for them to explain how they do it • Many computer vision tasks • Speech recognition • Lots of NLP problems (e.g., document classification) • Many spatial navigation problems • Bonus if many people have to do this thing
  • 28.
    #MLSEV 28 #2: HumanExperts are Expensive • Cases where it’s tough to get your hands on an expert, or their knowledge is too deep to be readily programmed • Medical diagnosis • Game-playing at high levels • Autonomous helicopter piloting
  • 29.
    #MLSEV 29 #3: EveryoneGets Their Own Algorithm • Cases where a specific model in thousands of locations would be better than one big system, and each location is generating the data necessary to create one • Location prediction (via mobile) • Spam detection (from content) • Demand prediction
  • 30.
    #MLSEV 30 #4: EveryLittle Bit Counts • Cases where performance is the overarching concern, even at very small increments • Market trading, financial modeling • High volume vision tasks where mistakes are costly • Some product recommendation problems
  • 31.
    #MLSEV 31 Some NegativeExamples • Human experts are cheap and easy to come by (lots of examples in NLP and vision) • Performance of humans is better (though it may be slower and more expensive) • A competent program can easily be written by hand • The data is difficult to acquire and/or to label
  • 32.
  • 33.
    #MLSEV 33 Am IIn The Right Room? • There are lots of problems solvable via ML that don’t fall exactly into any of these buckets • Machine translation • Image Segmentation • Game-playing • “Matching” problems • Let’s talk about a few of these other buckets now
  • 34.
    #MLSEV 34 Label SequenceLearning Part-of-speech tagging, OCR, multimedia annotation We played outside yesterday pronoun verb adverb adverb • Predictions come in a sequence • Correct value may be “context dependent”
  • 35.
    #MLSEV 35 Metric Learning Documentmatching, query processing, recommendation • Learn an embedding for the data • In the new space, things that are related should be close together, and unrelated things should be far apart • The objective isn’t usually a column, but a list of pairs of things that are “related” and “unrelated”
  • 36.
    #MLSEV 36 Reinforcement Learning Gameplaying, planning, control systems • Predictions are sequential and when you make one (take an action), it influences the next prediction you have to make (next state) • You may or may not get a reward when you take an action in a certain state • Taking a certain action in a certain state might not always result in the same next state or reward • The action space is often infinite, structured, and/or conditional on the state • You learn from a simulator instead of a dataset
  • 37.
    #MLSEV 37 A SolutionMight Be At Hand • All of these problems have their own algorithms, even their own niches in the academic literature • Sometimes, solutions can be assembled by using standard algorithms as parts of a solution • See, for example, sliding window classifiers • . . . and WhizzML!
  • 38.
    #MLSEV 38 Summary • Thequickest way to machine learning is to get tabular data • If you know what you want to predict, it’s supervised learning, if you don’t, it’s unsupervised learning • Machine learning will only work if you can leverage the advantages of the data + algorithm paradigm