Strategies for Practical Active
LearningRobert Munro, PhD
VP of Machine Learning, CrowdFlower
@WWRob
Open Data Science Conference #ODSC
November 3, 2017
My Background
Disaster Response/Recovery
Stanford PhD: NLP in Health and Disaster Response
Product for NLP at AWS’s Amazon AI
VP of ML at CrowdFlower: Annotation and Human-in-the-
Loop ML
What is Active Learning?
Active Learning
What is Active Learning?
• Selecting the optimal data to manually label for Machine Learning
Why it is important?
• The right data can increase accuracy more than the algorithm
Why it is overlooked?
• Active Learning is everywhere in industry, but <5% of academic papers
Active Learning
Selecting the optimal data to manually label for Machine Learning
Often a continuous feedback loop
“Please identify
pictures of cats,
like this one”
“Ok!”
“Are these cats?”
Why is Active Learning
Important?
Human resources are limited. What is the right data to focus on?
“Please identify
pictures of cats,
like this one”
“Ok!”
“Are these cats?”
Mentions in ACM papers for AI-related terms (http://dl.acm.org/):
Academia has largely ignored Active Learning
Why is Active Learning
Overlooked?
Background:
ImageNet and TensorFlow
ImageNet
~1 Million images labeled
with 1,000+ categories
The categories are from
WordNet: hierarchy of terms
Source: http://image-net.org/explore
TensorFlow:
an open Machine Learning library
We will use a pre-trained Deep Learning model for ImageNet
Deep Learning models for images are networks of ‘layers’ where each
layer is a further refinement from raw pixels to the target label
Matthew Zeiler and Rob Fergus. ZF NET
TensorFlow’s ImageNet model
Example output :
['canoe', 0.90240431], ['paddle,
boat paddle', 0.042475685],
['gondola', 0.0011620093],
['sandbar, sand bar',
0.0011261732], ['snorkel',
0.00047367468]
Predicting that this image is a
‘canoe’ with 90.2% confidence, is
a ‘paddle/boat paddle’ with 4.2%
confidence, etc
Active Learning:
What should humans review to add new
labels to?
Starter code
“Active Learning with
TensorFlow and ImageNet”
https://github.com/rmunro/active_learning_imagenet
Starter code and images to use
Active Learning to apply ImageNet labels to
new sports-related images
Ambiguous items
Example:
the top two predictions have
36.9% and 32.2% confidence
[['volleyball', 0.36908466],
['balance beam, beam',
0.32213417], ['stage',
0.020542733], ['basketball',
0.019910889], ['horizontal bar,
high bar', 0.011983166]]
Low confidence items
Example:
the top prediction has only
11.2% confidence
['parachute, chute',
0.11202857], ['geyser',
0.075139046], ['wing',
0.074320331], ['cliff, drop,
drop-off', 0.074191555],
['balloon', 0.053766355]
Randomly selected items
Evaluate accuracy on a
random set of items
The most valuable items to label
are confidently wrong
[['volleyball', 0.80830169], ['rugby
ball', 0.029293904], ['bathing cap,
swimming cap', 0.020639554],
['soccer ball', 0.020503236],
['bikini, two-piece', 0.011906843]]
Advanced Active Learning
Clustering (unsupervised or 2nd-to-last
layer)
Select equal numbers from all clusters
Advanced Active Learning
Using external resources:
eg WordNet distance between top predictions
['lawn mower, mower', 0.44160703], ['crash
helmet', 0.18804552], ['vacuum, vacuum
cleaner', 0.038397752], ['go-kart',
0.03737054], ['motor scooter, scooter',
0.033097573]
Active Learning Exceptions
What if low confidence items are not spread across all classes?
Eg: Squash or Racquetball?
Active Learning Exceptions
What if you care about some types of labels more than others?
Over-sample the labels you care about.
Use clustering or external resources.
Be careful about introducing bias!
Trade-offs:
• More repetitive work is faster but more error prone due to boredom
• Less repetitive work is slower but more accurate
Workflows:
• What is the best interface to get unbiased data for evaluation?
• What is the fastest interface to get human verification on confident model
predictions?
For starter code on interfaces, see:
https://github.com/rmunro/annotation_imagenet
Interface Design for
Annotations
Getting Human Judgments
In reality, you can chose:
– Crowdsourced workers
– Trained contractors
– Business Process Outsources
– Your own in-house annotators
– Some combination of the above
Getting Human Judgments
Ensuring Quality Annotations
1. Embed ‘gold’ (known) answers to quiz workers and track accuracy
2. Select the right annotators for the job
3. Give the same job to multiple people, and tracking agreement
4. Break up complex tasks into simpler ones
5. Remove ordering effects and ‘priming’
6. Subjective task? Use Bayesian truth serum
Trade-offs:
• More repetitive work is faster but more error prone due to boredom
• Less repetitive work is slower but more accurate
Workflows:
• What is the best interface to get unbiased data for evaluation?
• What is the fastest interface to get human verification on confident model
predictions?
For starter code on interfaces, see:
https://github.com/rmunro/annotation_imagenet
Interface Design for
Annotations
Getting started
Annotate ~10% of new data randomly. This is your
baseline
Use random, held-out data for accuracy:
micro-F, macro-f, ROC, entropy / information gain
Getting started
Annotate ~90% of new data using ambiguous or low-
confidence. Compare 10% subset to baseline.
More accurate? Continue!
Getting started
Annotate 90% of new data using ambiguous or low-
confidence. Compare 10% subset to baseline.
Less accurate? Try more advanced strategies!
Getting started
Does accuracy start to plateau or
decline relative to baseline? You
might be biased towards a subset:
Increase % of randomly selected
items
Still not working? Looking into
clustering and other methods to
ensure data variety
Summary
What is Active Learning?
• Selecting the optimal data to manually label for Machine Learning
• You now know how to do this!
Why it is important?
• The right data can increase accuracy more than the algorithm
• Test this for yourself!
Why it is overlooked?
• Active Learning is everywhere in industry, but <5% of academic papers
• Please share your results!
Thank You

Strategies for Practical Active Learning, Robert Munro

  • 1.
    Strategies for PracticalActive LearningRobert Munro, PhD VP of Machine Learning, CrowdFlower @WWRob Open Data Science Conference #ODSC November 3, 2017
  • 2.
    My Background Disaster Response/Recovery StanfordPhD: NLP in Health and Disaster Response Product for NLP at AWS’s Amazon AI VP of ML at CrowdFlower: Annotation and Human-in-the- Loop ML
  • 3.
    What is ActiveLearning?
  • 4.
    Active Learning What isActive Learning? • Selecting the optimal data to manually label for Machine Learning Why it is important? • The right data can increase accuracy more than the algorithm Why it is overlooked? • Active Learning is everywhere in industry, but <5% of academic papers
  • 5.
    Active Learning Selecting theoptimal data to manually label for Machine Learning Often a continuous feedback loop “Please identify pictures of cats, like this one” “Ok!” “Are these cats?”
  • 6.
    Why is ActiveLearning Important? Human resources are limited. What is the right data to focus on? “Please identify pictures of cats, like this one” “Ok!” “Are these cats?”
  • 7.
    Mentions in ACMpapers for AI-related terms (http://dl.acm.org/): Academia has largely ignored Active Learning Why is Active Learning Overlooked?
  • 8.
  • 9.
    ImageNet ~1 Million imageslabeled with 1,000+ categories The categories are from WordNet: hierarchy of terms Source: http://image-net.org/explore
  • 10.
    TensorFlow: an open MachineLearning library We will use a pre-trained Deep Learning model for ImageNet Deep Learning models for images are networks of ‘layers’ where each layer is a further refinement from raw pixels to the target label Matthew Zeiler and Rob Fergus. ZF NET
  • 11.
    TensorFlow’s ImageNet model Exampleoutput : ['canoe', 0.90240431], ['paddle, boat paddle', 0.042475685], ['gondola', 0.0011620093], ['sandbar, sand bar', 0.0011261732], ['snorkel', 0.00047367468] Predicting that this image is a ‘canoe’ with 90.2% confidence, is a ‘paddle/boat paddle’ with 4.2% confidence, etc
  • 12.
    Active Learning: What shouldhumans review to add new labels to?
  • 13.
    Starter code “Active Learningwith TensorFlow and ImageNet” https://github.com/rmunro/active_learning_imagenet Starter code and images to use Active Learning to apply ImageNet labels to new sports-related images
  • 14.
    Ambiguous items Example: the toptwo predictions have 36.9% and 32.2% confidence [['volleyball', 0.36908466], ['balance beam, beam', 0.32213417], ['stage', 0.020542733], ['basketball', 0.019910889], ['horizontal bar, high bar', 0.011983166]]
  • 15.
    Low confidence items Example: thetop prediction has only 11.2% confidence ['parachute, chute', 0.11202857], ['geyser', 0.075139046], ['wing', 0.074320331], ['cliff, drop, drop-off', 0.074191555], ['balloon', 0.053766355]
  • 16.
    Randomly selected items Evaluateaccuracy on a random set of items The most valuable items to label are confidently wrong [['volleyball', 0.80830169], ['rugby ball', 0.029293904], ['bathing cap, swimming cap', 0.020639554], ['soccer ball', 0.020503236], ['bikini, two-piece', 0.011906843]]
  • 17.
    Advanced Active Learning Clustering(unsupervised or 2nd-to-last layer) Select equal numbers from all clusters
  • 18.
    Advanced Active Learning Usingexternal resources: eg WordNet distance between top predictions ['lawn mower, mower', 0.44160703], ['crash helmet', 0.18804552], ['vacuum, vacuum cleaner', 0.038397752], ['go-kart', 0.03737054], ['motor scooter, scooter', 0.033097573]
  • 19.
    Active Learning Exceptions Whatif low confidence items are not spread across all classes? Eg: Squash or Racquetball?
  • 20.
    Active Learning Exceptions Whatif you care about some types of labels more than others? Over-sample the labels you care about. Use clustering or external resources. Be careful about introducing bias!
  • 21.
    Trade-offs: • More repetitivework is faster but more error prone due to boredom • Less repetitive work is slower but more accurate Workflows: • What is the best interface to get unbiased data for evaluation? • What is the fastest interface to get human verification on confident model predictions? For starter code on interfaces, see: https://github.com/rmunro/annotation_imagenet Interface Design for Annotations
  • 22.
  • 23.
    In reality, youcan chose: – Crowdsourced workers – Trained contractors – Business Process Outsources – Your own in-house annotators – Some combination of the above Getting Human Judgments
  • 24.
    Ensuring Quality Annotations 1.Embed ‘gold’ (known) answers to quiz workers and track accuracy 2. Select the right annotators for the job 3. Give the same job to multiple people, and tracking agreement 4. Break up complex tasks into simpler ones 5. Remove ordering effects and ‘priming’ 6. Subjective task? Use Bayesian truth serum
  • 25.
    Trade-offs: • More repetitivework is faster but more error prone due to boredom • Less repetitive work is slower but more accurate Workflows: • What is the best interface to get unbiased data for evaluation? • What is the fastest interface to get human verification on confident model predictions? For starter code on interfaces, see: https://github.com/rmunro/annotation_imagenet Interface Design for Annotations
  • 26.
    Getting started Annotate ~10%of new data randomly. This is your baseline Use random, held-out data for accuracy: micro-F, macro-f, ROC, entropy / information gain
  • 27.
    Getting started Annotate ~90%of new data using ambiguous or low- confidence. Compare 10% subset to baseline. More accurate? Continue!
  • 28.
    Getting started Annotate 90%of new data using ambiguous or low- confidence. Compare 10% subset to baseline. Less accurate? Try more advanced strategies!
  • 29.
    Getting started Does accuracystart to plateau or decline relative to baseline? You might be biased towards a subset: Increase % of randomly selected items Still not working? Looking into clustering and other methods to ensure data variety
  • 30.
    Summary What is ActiveLearning? • Selecting the optimal data to manually label for Machine Learning • You now know how to do this! Why it is important? • The right data can increase accuracy more than the algorithm • Test this for yourself! Why it is overlooked? • Active Learning is everywhere in industry, but <5% of academic papers • Please share your results!
  • 31.