Machine Learning - Black Art
Charles Parker
Allston Trading
Machine Learning is Hard!
• By now, you know kind of a lot

• Different types of models

• Feature engineering

• Ways to evaluate

• But you’ll still fail!

• Out in the real world, there’s a
whole bunch of things that will kill
your project

• FYI - A lot of these talks are stolen
2
Join Me!
• On a journey into the Machine Learning House of
Horrors!

• Mwa ha ha!
3
5
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Choosing A Hypothesis Space
• By “hypothesis space” we
mean the possible classifiers
you could build with an
algorithm given the data

• This is the choice you make
when you pick a learning
algorithm

• You have one job!

• Is there any way to make it
easier?
6
Theory to The Rescue!
• Probably Approximately Correct

• We’d like our model to have error less than epsilon

• We’d like that to happen at least some percentage of the time

• If the error is epsilon, the percentage is sigma, the number of
training examples is m, and the hypothesis space size is d:
7
The Triple Trade-Off
• There is a triple-trade off between the error, the size
of the hypothesis space, and the amount of training
data you have
8
Error
Hypothesis Space Training Data
What About Huge Data?
• I’m clever, so I’ll use non-
parametric methods (Decision
tree, k-NN, kernelized SVMs)

• As data scales, curious things
tend to happen

• Simpler models become more
desirable as they’re faster to fit.

• You can increase model
complexity by adding features
(maybe word counts)

• Big data often trumps modeling!
9
10
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
A Dirty Little Secret About ML Algorithms
• They don’t care what you want

• Decision Trees:

• SVM:

• LR:

• LDA:
11
Real-world Losses
• Real losses are nothing like this

• False positive in disease
diagnosis

• False positive in face
detection

• False positive in thumbprint
identification

• Some aren’t even instance-
based

• Path dependencies

• Game playing
12
Specializing Your Loss
• One solution is to let developers apply their own loss

• This is the approach of SVM light: 

http://svmlight.joachims.org/

It’s been around for a while

• Losses other than Mutual Information can be plugged into the appropriate
place in splitting code

• Models trained via gradient descent can obviously be customized (Python’s
Theano is interesting for this)

• In the case of multi-example loss function, we have SEARN in Vowpal Wabbit

https://github.com/JohnLangford/vowpal_wabbit
13
Other Hackery
• Sometimes, the solution is just to hack
around the actual prediction

• Have several levels (cascade) of
classifiers in e.g., medical diagnosis, text
recognition

• Apply logic to explicitly avoid high loss
cases (e.g., when buying/selling equities)

• Changing the problem setting

• Will you be doing queries? Use ranking
or metric learning

• “I want to do crazy thing x with
classifiers”, chances are it’s already been
done and you can read about it.
14
15
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
When Validation Attacks!
• Cross validation

• n-Fold - Hold out one fold for
testing, train on n - 1 folds

• Great way to measure
performance, right?

• It’s all about information leakage

• via instances

• via features
16
Case Study #1: Law of Averages
• Estimate sporting event
outcomes

• Use previous games to
estimate points scored for
each team (via windowing
transform)

• Choose winner based on
predicted score

• What if you’re off by one on
the window?
17
Case Study #2: Photo Dating
• Take scanned photos from
30 different users (on
average 200 per user) and
create a model to assign a
date taken (plus or minus
five years)

• Perform 10-cross
validation

• Accuracy is 85%. Can
you trust it?
18
Case Study #3: Moments In Time
• You have a buy/sell
opportunity every five
seconds

• The signals you use to
evaluate the opportunity
are aggregates of market
activity over the last five
minutes

• How careful must you be
with cross-validation?
19
20
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Breaking Machine Learning
• You’ve got this great model!
Congratulations!

• Suddenly it stops working.
Why?

• You might be in a domain
that tends to change over
time (document classification,
sales prediction)

• You might be experiencing
adverse selection (market
data predictions, spam)
21
Concept Drift
• This is called non-stationarity in either the prior or the conditional
distributions

• Could be a couple of different things

• If the prior p(input) is changing, it’s covariate shift

• If the conditional p(output | input) is changing, it’s concept drift

• No rule that it can’t be both

• http://blog.bigml.com/2013/03/12/machine-learning-from-
streaming-data-two-problems-two-solutions-two-concerns-and-
two-lessons/
22
Take Action!
• First: Look for symptoms

• Getting a lot of errors

• The distribution of predicted values changes

• Drift detection algorithms (that I know about) have the same basic flavor:

• Buffer some data in memory

• If recent data is “different” from past data, retrain, update or give up

• Some resources - A nice survey paper and an open source package:
23
http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf

http://moa.cms.waikato.ac.nz/
The Benefits of Archeology
• Why might you train on old
data, even if it’s not relevant?

• Verification of your research
process

• You’d do the same thing
last year. Did it work?

• Gives you a good idea of
how much drift you should
expect
24
25
• The Horror of The Huge Hypothesis Space

• The Perils of The Poorly Picked Loss Function

• The Creeping Creature Called Cross Validation

• The Dread of the Drifting Domain

• The Repugnance of Reliance on Research Results
The Machine Learning House of Horrors!
Publish or Perish
• Academic papers are a certain type of
result

• Show incremental improvement in
accuracy or generality

• Prove something about your
algorithm

• This latter is hard to come by as results
get more realistic

• Machine learning proofs assume data
is “i.i.d”, but this is obviously false.

• Real world data sucks, and dealing
with that significantly changes the
dataset
26
Usefulness of Results
• Theoretical Results

• Most of the time bounds do not apply (error, sample
complexity, convergence)

• Sometimes they don’t even make any sense

• Beware of putting too much faith in a single person or single
person’s work

• Usefulness generally occurs only in the aggregate

• And sometimes not even then (researchers are people, too)
27
Machine Learning Isn’t About Machine Learning
• Why doesn’t it work like in the
paper?

• Remember, the paper is carefully
controlled in a way your application
is not.

• Performance is rarely driven by
machine learning

• It’s driven by camera
microphones

• It’s driven by Mario Draghi
28
So, Don’t Bother With It?
• Of course not!

• What’s the alternative?

• “All our science, measured
against reality, is primitive
and childlike — and yet it is
the most precious thing we
have” - Albert Einstein

• Use academia as your
starting point, but don’t
think it will get you out of
the work
29
Some Themes
• The major points of this talk:

• Machine learning is hard to get right

• The algorithms won’t do what you want

• Good results are probably spurious

• Even if they aren’t, it won’t last

• Reading the research won’t help

• Wait, no!

• Have an attitude of skeptical optimism (or optimal skepticism?)
30

L15. Machine Learning - Black Art

  • 1.
    Machine Learning -Black Art Charles Parker Allston Trading
  • 2.
    Machine Learning isHard! • By now, you know kind of a lot • Different types of models • Feature engineering • Ways to evaluate • But you’ll still fail! • Out in the real world, there’s a whole bunch of things that will kill your project • FYI - A lot of these talks are stolen 2
  • 3.
    Join Me! • Ona journey into the Machine Learning House of Horrors! • Mwa ha ha! 3
  • 4.
    5 • The Horrorof The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 5.
    Choosing A HypothesisSpace • By “hypothesis space” we mean the possible classifiers you could build with an algorithm given the data • This is the choice you make when you pick a learning algorithm • You have one job! • Is there any way to make it easier? 6
  • 6.
    Theory to TheRescue! • Probably Approximately Correct • We’d like our model to have error less than epsilon • We’d like that to happen at least some percentage of the time • If the error is epsilon, the percentage is sigma, the number of training examples is m, and the hypothesis space size is d: 7
  • 7.
    The Triple Trade-Off •There is a triple-trade off between the error, the size of the hypothesis space, and the amount of training data you have 8 Error Hypothesis Space Training Data
  • 8.
    What About HugeData? • I’m clever, so I’ll use non- parametric methods (Decision tree, k-NN, kernelized SVMs) • As data scales, curious things tend to happen • Simpler models become more desirable as they’re faster to fit. • You can increase model complexity by adding features (maybe word counts) • Big data often trumps modeling! 9
  • 9.
    10 • The Horrorof The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 10.
    A Dirty LittleSecret About ML Algorithms • They don’t care what you want • Decision Trees: • SVM: • LR: • LDA: 11
  • 11.
    Real-world Losses • Reallosses are nothing like this • False positive in disease diagnosis • False positive in face detection • False positive in thumbprint identification • Some aren’t even instance- based • Path dependencies • Game playing 12
  • 12.
    Specializing Your Loss •One solution is to let developers apply their own loss • This is the approach of SVM light: http://svmlight.joachims.org/ It’s been around for a while • Losses other than Mutual Information can be plugged into the appropriate place in splitting code • Models trained via gradient descent can obviously be customized (Python’s Theano is interesting for this) • In the case of multi-example loss function, we have SEARN in Vowpal Wabbit https://github.com/JohnLangford/vowpal_wabbit 13
  • 13.
    Other Hackery • Sometimes,the solution is just to hack around the actual prediction • Have several levels (cascade) of classifiers in e.g., medical diagnosis, text recognition • Apply logic to explicitly avoid high loss cases (e.g., when buying/selling equities) • Changing the problem setting • Will you be doing queries? Use ranking or metric learning • “I want to do crazy thing x with classifiers”, chances are it’s already been done and you can read about it. 14
  • 14.
    15 • The Horrorof The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 15.
    When Validation Attacks! •Cross validation • n-Fold - Hold out one fold for testing, train on n - 1 folds • Great way to measure performance, right? • It’s all about information leakage • via instances • via features 16
  • 16.
    Case Study #1:Law of Averages • Estimate sporting event outcomes • Use previous games to estimate points scored for each team (via windowing transform) • Choose winner based on predicted score • What if you’re off by one on the window? 17
  • 17.
    Case Study #2:Photo Dating • Take scanned photos from 30 different users (on average 200 per user) and create a model to assign a date taken (plus or minus five years) • Perform 10-cross validation • Accuracy is 85%. Can you trust it? 18
  • 18.
    Case Study #3:Moments In Time • You have a buy/sell opportunity every five seconds • The signals you use to evaluate the opportunity are aggregates of market activity over the last five minutes • How careful must you be with cross-validation? 19
  • 19.
    20 • The Horrorof The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 20.
    Breaking Machine Learning •You’ve got this great model! Congratulations! • Suddenly it stops working. Why? • You might be in a domain that tends to change over time (document classification, sales prediction) • You might be experiencing adverse selection (market data predictions, spam) 21
  • 21.
    Concept Drift • Thisis called non-stationarity in either the prior or the conditional distributions • Could be a couple of different things • If the prior p(input) is changing, it’s covariate shift • If the conditional p(output | input) is changing, it’s concept drift • No rule that it can’t be both • http://blog.bigml.com/2013/03/12/machine-learning-from- streaming-data-two-problems-two-solutions-two-concerns-and- two-lessons/ 22
  • 22.
    Take Action! • First:Look for symptoms • Getting a lot of errors • The distribution of predicted values changes • Drift detection algorithms (that I know about) have the same basic flavor: • Buffer some data in memory • If recent data is “different” from past data, retrain, update or give up • Some resources - A nice survey paper and an open source package: 23 http://www.win.tue.nl/~mpechen/publications/pubs/Gama_ACMCS_AdaptationCD_accepted.pdf http://moa.cms.waikato.ac.nz/
  • 23.
    The Benefits ofArcheology • Why might you train on old data, even if it’s not relevant? • Verification of your research process • You’d do the same thing last year. Did it work? • Gives you a good idea of how much drift you should expect 24
  • 24.
    25 • The Horrorof The Huge Hypothesis Space • The Perils of The Poorly Picked Loss Function • The Creeping Creature Called Cross Validation • The Dread of the Drifting Domain • The Repugnance of Reliance on Research Results The Machine Learning House of Horrors!
  • 25.
    Publish or Perish •Academic papers are a certain type of result • Show incremental improvement in accuracy or generality • Prove something about your algorithm • This latter is hard to come by as results get more realistic • Machine learning proofs assume data is “i.i.d”, but this is obviously false. • Real world data sucks, and dealing with that significantly changes the dataset 26
  • 26.
    Usefulness of Results •Theoretical Results • Most of the time bounds do not apply (error, sample complexity, convergence) • Sometimes they don’t even make any sense • Beware of putting too much faith in a single person or single person’s work • Usefulness generally occurs only in the aggregate • And sometimes not even then (researchers are people, too) 27
  • 27.
    Machine Learning Isn’tAbout Machine Learning • Why doesn’t it work like in the paper? • Remember, the paper is carefully controlled in a way your application is not. • Performance is rarely driven by machine learning • It’s driven by camera microphones • It’s driven by Mario Draghi 28
  • 28.
    So, Don’t BotherWith It? • Of course not! • What’s the alternative? • “All our science, measured against reality, is primitive and childlike — and yet it is the most precious thing we have” - Albert Einstein • Use academia as your starting point, but don’t think it will get you out of the work 29
  • 29.
    Some Themes • Themajor points of this talk: • Machine learning is hard to get right • The algorithms won’t do what you want • Good results are probably spurious • Even if they aren’t, it won’t last • Reading the research won’t help • Wait, no! • Have an attitude of skeptical optimism (or optimal skepticism?) 30