PWL Seattle #23 - A Few Useful Things to Know About Machine Learning

A Few Useful Things to Know about
Machine Learning

Pedro Domingos

Prepared by Tristan Penman
Papers We Love @ Sea8le, September 2016

About The Paper
A Few Useful Things to Know about Machine Learning
Pedro Domingos
Department of Computer Science and Engineering
University of Washington
Seattle, WA 98195-2350, U.S.A.
pedrod@cs.washington.edu
ABSTRACT
Machine learning algorithms can figure out how to perform
important tasks by generalizing from examples. This is of-
ten feasible and cost-effective where manual programming
is not. As more data becomes available, more ambitious
problems can be tackled. As a result, machine learning is
widely used in computer science and other fields. However,
developing successful machine learning applications requires
a substantial amount of “black art” that is hard to find in
textbooks. This article summarizes twelve key lessons that
machine learning researchers and practitioners have learned.
These include pitfalls to avoid, important issues to focus on,
and answers to common questions.
1. INTRODUCTION
Machine learning systems automatically learn programs from
data. This is often a very attractive alternative to manually
constructing them, and in the last decade the use of machine
learning has spread rapidly throughout computer science
and beyond. Machine learning is used in Web search, spam
filters, recommender systems, ad placement, credit scoring,
fraud detection, stock trading, drug design, and many other
applications. A recent report from the McKinsey Global In-
stitute asserts that machine learning (a.k.a. data mining or
predictive analytics) will be the driver of the next big wave of
innovation [16]. Several fine textbooks are available to inter-
ested practitioners and researchers (e.g, [17, 25]). However,
much of the “folk knowledge” that is needed to successfully
develop machine learning applications is not readily avail-
able in them. As a result, many machine learning projects
take much longer than necessary or wind up producing less-
than-ideal results. Yet much of this folk knowledge is fairly
easy to communicate. This is the purpose of this article.
Many different types of machine learning exist, but for il-
lustration purposes I will focus on the most mature and
widely used one: classification. Nevertheless, the issues I
will discuss apply across all of machine learning. A classi-
fier is a system that inputs (typically) a vector of discrete
and/or continuous feature values and outputs a single dis-
crete value, the class. For example, a spam filter classifies
email messages into“spam”or“not spam,”and its input may
be a Boolean vector x = (x1, . . . , xj, . . . , xd), where xj = 1
if the jth word in the dictionary appears in the email and
xj = 0 otherwise. A learner inputs a training set of exam-
ples (xi, yi), where xi = (xi,1, . . . , xi,d) is an observed input
and yi is the corresponding output, and outputs a classifier.
The test of the learner is whether this classifier produces the
correct output yt for future examples xt (e.g., whether the
spam filter correctly classifies previously unseen emails as
spam or not spam).
2. LEARNING = REPRESENTATION +
EVALUATION + OPTIMIZATION
Suppose you have an application that you think machine
learning might be good for. The first problem facing you
is the bewildering variety of learning algorithms available.
Which one to use? There are literally thousands available,
and hundreds more are published each year. The key to not
getting lost in this huge space is to realize that it consists
of combinations of just three components. The components
are:
Representation. A classifier must be represented in some
formal language that the computer can handle. Con-
versely, choosing a representation for a learner is tan-
tamount to choosing the set of classifiers that it can
possibly learn. This set is called the hypothesis space
of the learner. If a classifier is not in the hypothesis
space, it cannot be learned. A related question, which
we will address in a later section, is how to represent
the input, i.e., what features to use.
Evaluation. An evaluation function (also called objective
function or scoring function) is needed to distinguish
good classifiers from bad ones. The evaluation function
used internally by the algorithm may differ from the
external one that we want the classifier to optimize, for
ease of optimization (see below) and due to the issues
discussed in the next section.
Optimization. Finally, we need a method to search among
the classifiers in the language for the highest-scoring
one. The choice of optimization technique is key to the
efficiency of the learner, and also helps determine the
classifier produced if the evaluation function has more
than one optimum. It is common for new learners to
start out using off-the-shelf optimizers, which are later
replaced by custom-designed ones.
Table 1 shows common examples of each of these three com-
ponents. For example, k-nearest neighbor classifies a test
example by finding the k most similar training examples
and predicting the majority class among them. Hyperplane-
based methods form a linear combination of the features per
class and predict the class with the highest-valued combina-
tion. Decision trees test one feature at each internal node,
•  First published in 2012 in CACM
•  Recognizes that success Machine
Learning is something of a black art,
communicated as folk knowledge
•  Summarizes twelve key lessons
about Machine Learning that should
be internalized by students and
pracKKoners alike
•  Does not contain all the answers,
but points you in the right direcKon
and challenges assumpKons

About The Author
•  Pedro Domingos
–  Professor at The University of Washington
–  Proliﬁc author and researcher in the ﬁelds of Machine
Learning and Data Mining
–  Author of The Master Algorithm
•  His website contains a plethora of interesKng material:
h0p://homes.cs.washington.edu/~pedrod/
•  The paper discussed in this talk can found at:
h0p://homes.cs.washington.edu/~pedrod/papers/cacm12.pdf

•  Explore a sample of ﬁve lessons included in the paper
–  MoKvate students to read the paper and to gain an appreciaKon for
pracAcal Machine Learning
–  MoKvate pracKKoners to read the paper, and plan to re-read it
mulAple Ames throughout their career
•  Intersperse with some useful resources
–  Books covering important background material
–  Some online resources
•  And to reinforce a mindset required for the successful
applicaKon of Machine Learning
Talk Overview

•  Learning = RepresentaAon + EvaluaAon + OpAmizaAon
•  It’s GeneralizaAon That Counts
•  Data Alone is Not Enough
•  OverﬁOng Has Many Faces
•  IntuiAon Fails In High Dimensions
•  TheoreAcal Guarantees Are Not What They Seem
•  Feature Engineering is the Key
•  More Data Beats a Cleverer Algorithm
•  Learn Many Models, Not Just One
•  Simplicity Does Not Imply Accuracy
•  Representable Does Not Imply Learnable
•  CorrelaAon Does not Imply CausaAon
Key Lessons

Learning
= RepresentaAon + EvaluaAon + OpAmizaAon

–  Representa0on: “A classifier must be represented in
some formal language that the computer can handle”
–  Evalua0on: “An evaluaAon funcAon is needed to
disAnguish good classifiers from bad ones.”
–  Op0miza0on: “we need a method to search among the
classifiers in the language for the highest-scoring one”

Learning
= RepresentaAon + EvaluaAon + OpAmizaAon

OverfiOng Has Many Faces
•  “What if the knowledge and data we have are not sufficient to
completely determine the correct classifier?”
•  “Then we run the risk of just
hallucinaAng a classifier …
that is not grounded in reality”
•  Be diligent about maintaining
separate training and test data
•  Study and understand your tools:
–  Cross-validaKon
–  Bootstrapping
–  RegularizaKon
Late stage DeepDream processed
photograph of three men in a pool,
showing that there are some useful
analogies that can be drawn between
hallucinaKon and generaKve models.

OverﬁOng Has Many Faces
•  “One way to understand overﬁOng
is by decomposing generalizaAon
error into bias and variance”
•  Recommended reading
–  For students:

An IntroducAon to
StaAsAcal Learning with
ApplicaAons in R (ISLR)
–  For pracKKoners:

The Elements of StaAsAcal
Learning (ESLII)

ISLR and ESLII
•  ISLR nice introducKon to many staKsKcal
learning methods that are oen
considered to form the basis of Machine
Learning
•  ESLII is a more in-depth treatment of
this material, be8er suited to
pracKKoners
•  PDFs are freely available online
•  Also consider Stanford Online’s free
StaKsKcal Learning course, which uses
ISLR as source material:

h8p://Knyurl.com/z3f3scm

Feature Engineering is the Key
•  “Oaen, the raw data is not in a form that is amenable to
learning, but you can construct features from it that are.
•  “This is typically where most of the eﬀort in a machine
learning project goes.”

•  Eﬀort will be generally be required to get decent results
–  Collect your data
–  Analyze and understand it in depth
–  Filter, cleanse, purge
–  Aggregate, segregate, etc…

•  Titanic: Machine Learning From Disaster

h8p://www.kaggle.com/c/Ktanic
•  Trevor Stephens’ Titanic: GeOng Started with R

h8p://trevorstephens.com/kaggle-Ktanic-tutorial/gebng-started-with-r
•  For a hands-on introducKon to Feature
Engineering, try out the introductory
Titanic compeKKon on Kaggle
•  Track your progress on the leaderboard,
and see how other teams have achieved
some very impressive results given
limited data.
Titanic CompeKKon on Kaggle

Learn Many Models, Not Just One
•  “then researchers noAced that, if instead of selecAng the
best variaAon found, we combine many variaAons, the
results are be0er—oaen much be0er”
•  In relaKon to the Nedlix prize:

“The winner and runner-up were both stacked ensembles of
over 100 learners, and combining the two ensembles
further improved the results.
•  The key idea here is that if you can aﬀord the loss of
interpretability, embracing ensembles yields great results

Representable Does Not Imply
Learnable
•  “…just because a funcAon can be represented does not
mean it can be learned…”
–  Lack of data (relaKve to number of predictors)
–  RepresentaKon is too large
–  Model availability given other requirements
–  ConKnuous vs. discrete funcKons

•  “Therefore the key quesAon is not ‘Can it be represented?’,
to which the answer is oaen trivial, but ‘Can it be learned?’”

Figure 8.7 from ISLR

Top Row: A two-dimensional
classiﬁcaKon example in which the
true decision boundary is linear, and
is indicated by the shaded regions.
A classical approach that assumes a
linear boundary (le) will outperform
a decision tree that performs splits
parallel to the axes (right).

BoAom Row: Here the true decision
boundary is non-linear. Here a linear
model is unable to capture the true
decision boundary (le), whereas a
decision tree is successful (right).

Representable Does Not Imply
Learnable

Mindset for Success
•  Meditate on the data you collect and deeply understand how
it is, or can be, represented
•  Ensure you have enough good data, and be prepared to admit
the limitaKons of the data you have
•  Cleanse the data and massage into a form that is appropriate
for machine learning
•  Test diﬀerent models and hypotheses
•  Combine models for greater eﬀect
•  Accept the possibly of failure

The Master Algorithm
•  Pedro’s book, The Master Algorithm,
links ideas from neuroscience, evoluKon,
psychology, physics and staKsKcs to
“assemble a blueprint for the future
universal learner”:

h8p://a.co/34CmC2B

•  This similarly Ktled talk serves as a nice
introducKon to the book, covering some
basic ideas such as how the schools of
thought diﬀer, yet are sKll linked by the
pursuit of a Universal Learner:

h8ps://youtu.be/B8J4uefCQMc

Thanks for listening

And please feel free to ask quesKons

PWL Seattle #23 - A Few Useful Things to Know About Machine Learning

Recommended

Recommended

More Related Content

Similar to PWL Seattle #23 - A Few Useful Things to Know About Machine Learning

Similar to PWL Seattle #23 - A Few Useful Things to Know About Machine Learning (20)

Recently uploaded

Recently uploaded (20)

PWL Seattle #23 - A Few Useful Things to Know About Machine Learning